2025-12-04T12:46:20.2008527Z Current runner version: '2.329.0' 2025-12-04T12:46:20.2011601Z Runner name: 'linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8' 2025-12-04T12:46:20.2012004Z Runner group name: 'default' 2025-12-04T12:46:20.2012420Z Machine name: 'linux' 2025-12-04T12:46:20.2013541Z ##[group]GITHUB_TOKEN Permissions 2025-12-04T12:46:20.2014629Z Contents: read 2025-12-04T12:46:20.2014884Z Metadata: read 2025-12-04T12:46:20.2015145Z ##[endgroup] 2025-12-04T12:46:20.2016252Z Secret source: Actions 2025-12-04T12:46:20.2016589Z Prepare workflow directory 2025-12-04T12:46:20.2258059Z Prepare all required actions 2025-12-04T12:46:20.2277697Z Getting action download info 2025-12-04T12:46:20.6896258Z Download action repository 'pytorch/pytorch@main' (SHA:a2b5dfb956aed182f6aefce1ff2eda70c35049e1) 2025-12-04T12:46:24.3072313Z Download action repository 'pytorch/test-infra@main' (SHA:39aa74d619174326f4e2fb0e216151c2f29d9ffd) 2025-12-04T12:46:25.3887279Z Download action repository 'actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T12:46:26.2858538Z Download action repository 'aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722' (SHA:ececac1a45f3b08a01d2dd070d28d111c5fe6722) 2025-12-04T12:46:27.2758408Z Getting action download info 2025-12-04T12:46:27.4637568Z Download action repository 'actions/checkout@v4' (SHA:34e114876b0b11c390a56381ad16ebd13914f8d5) 2025-12-04T12:46:28.2199994Z Getting action download info 2025-12-04T12:46:28.4132444Z Download action repository 'nick-fields/retry@v3.0.0' (SHA:7152eba30c6575329ac0576536151aca5a72780e) 2025-12-04T12:46:29.1516409Z Getting action download info 2025-12-04T12:46:29.3358717Z Uses: pytorch/pytorch/.github/workflows/_rocm-test.yml@refs/heads/main (ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T12:46:29.3360623Z ##[group] Inputs 2025-12-04T12:46:29.3360770Z build-environment: linux-jammy-rocm-py3.10 2025-12-04T12:46:29.3361961Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}]} 2025-12-04T12:46:29.3363358Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:29.3363642Z sync-tag: 2025-12-04T12:46:29.3364021Z timeout-minutes: 300 2025-12-04T12:46:29.3364127Z tests-to-include: 2025-12-04T12:46:29.3364224Z dashboard-tag: 2025-12-04T12:46:29.3364438Z disable-monitor: true 2025-12-04T12:46:29.3364558Z monitor-log-interval: 5 2025-12-04T12:46:29.3364676Z monitor-data-collect-interval: 1 2025-12-04T12:46:29.3364800Z ##[endgroup] 2025-12-04T12:46:29.3364981Z Complete job name: linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:46:29.3620670Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2025-12-04T12:46:29.3620939Z with: 2025-12-04T12:46:29.3621031Z no-sudo: true 2025-12-04T12:46:29.3621126Z submodules: recursive 2025-12-04T12:46:29.3621224Z fetch-depth: 0 2025-12-04T12:46:29.3621355Z env: 2025-12-04T12:46:29.3621588Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:29.3621700Z ##[endgroup] 2025-12-04T12:46:29.3664154Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T12:46:29.3664520Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T12:46:29.3671429Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:29.3671579Z env: 2025-12-04T12:46:29.3671667Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:29.3671768Z ##[endgroup] 2025-12-04T12:46:29.3830879Z ##[group]Run actions/checkout@v4 2025-12-04T12:46:29.3831076Z with: 2025-12-04T12:46:29.3831205Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:29.3831348Z fetch-depth: 0 2025-12-04T12:46:29.3831448Z submodules: recursive 2025-12-04T12:46:29.3831648Z show-progress: false 2025-12-04T12:46:29.3831764Z repository: pytorch/pytorch 2025-12-04T12:46:29.3831952Z token: *** 2025-12-04T12:46:29.3832052Z ssh-strict: true 2025-12-04T12:46:29.3832153Z ssh-user: git 2025-12-04T12:46:29.3832253Z persist-credentials: true 2025-12-04T12:46:29.3832366Z clean: true 2025-12-04T12:46:29.3832467Z sparse-checkout-cone-mode: true 2025-12-04T12:46:29.3832591Z fetch-tags: false 2025-12-04T12:46:29.3832689Z lfs: false 2025-12-04T12:46:29.3832783Z set-safe-directory: true 2025-12-04T12:46:29.3832890Z env: 2025-12-04T12:46:29.3832977Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:29.3833089Z ##[endgroup] 2025-12-04T12:46:29.4528216Z Syncing repository: pytorch/pytorch 2025-12-04T12:46:29.4528831Z ##[group]Getting Git version info 2025-12-04T12:46:29.4528999Z Working directory is '/home/runner/_work/pytorch/pytorch' 2025-12-04T12:46:29.4529269Z [command]/usr/bin/git version 2025-12-04T12:46:29.4529389Z git version 2.52.0 2025-12-04T12:46:29.4529787Z ##[endgroup] 2025-12-04T12:46:29.4534590Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/371441f3-1ed9-4c55-9dce-f6f0d1652ef2/.gitconfig' 2025-12-04T12:46:29.4541749Z Temporarily overriding HOME='/home/runner/_work/_temp/371441f3-1ed9-4c55-9dce-f6f0d1652ef2' before making global git config changes 2025-12-04T12:46:29.4542077Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T12:46:29.4544703Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T12:46:29.4576081Z [command]/usr/bin/git config --local --get remote.origin.url 2025-12-04T12:46:29.4590793Z https://github.com/pytorch/pytorch 2025-12-04T12:46:29.4602263Z ##[group]Removing previously created refs, to avoid conflicts 2025-12-04T12:46:29.4604966Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-12-04T12:46:29.4625381Z refs/heads/main 2025-12-04T12:46:29.4635931Z [command]/usr/bin/git checkout --detach 2025-12-04T12:46:31.1372091Z HEAD is now at 685ba6bc0117 add back legalize_graph for BC reason (#169541) 2025-12-04T12:46:31.1421042Z [command]/usr/bin/git branch --delete --force main 2025-12-04T12:46:31.1574992Z Deleted branch main (was 685ba6bc0117). 2025-12-04T12:46:31.1581785Z ##[endgroup] 2025-12-04T12:46:31.1584866Z [command]/usr/bin/git submodule status 2025-12-04T12:46:31.1790510Z 7e1e1fe3858c63c251c637ae41a20de425dde96f android/libs/fbjni (v0.1.0-12-g7e1e1fe) 2025-12-04T12:46:31.1844356Z 4dfe081cf6bcd15db339cf2680b9281b8451eeb3 third_party/FP16 (4dfe081) 2025-12-04T12:46:31.1893258Z b408327ac2a15ec3e43352421954f5b1967701d1 third_party/FXdiv (b408327) 2025-12-04T12:46:31.1961332Z c07e3a0400713d546e0dea2d5466dd22ea389c73 third_party/NNPACK (c07e3a0) 2025-12-04T12:46:31.1995139Z 3ebbc93ded7285963bff932c678fa367eb393ba6 third_party/NVTX (v3.1.0-313-g3ebbc93) 2025-12-04T12:46:31.2051510Z 1d8f600fd424278486eade7ed3e877c99f0846b1 third_party/VulkanMemoryAllocator (v2.1.0-982-g1d8f600) 2025-12-04T12:46:31.2360431Z 51a0103656eff6fc9bfd39a4597923c4b542c883 third_party/XNNPACK (remotes/origin/ds/ndk-1243-g51a0103656) 2025-12-04T12:46:31.2399209Z 01aae101b9e5e94d6c16a9514c9fb8df99c93150 third_party/aiter (v0.1.1-92-g01aae101) 2025-12-04T12:46:31.2416466Z 299e5928955cc62af9968370293b916f5130916f third_party/benchmark (v1.9.3) 2025-12-04T12:46:31.2468485Z 7fe50dc3da2069d6645d9deb8c017a876472a977 third_party/composable_kernel (rocm-6.4.3-459-g7fe50dc3d) 2025-12-04T12:46:31.2558301Z 89c932f313c6437c38f2982869beacc89c2f2246 third_party/cpp-httplib (v0.26.0) 2025-12-04T12:46:31.2634651Z f858c30bcb16f8effd5ff46996f0514539e17abc third_party/cpuinfo (f858c30) 2025-12-04T12:46:31.2661545Z 0b1577c8c83401237d601d0d0db5210506705396 third_party/cudnn_frontend (v0.5-61-g0b1577c) 2025-12-04T12:46:31.2721808Z f88806b1e31dfa579842638740216dd41fc6c588 third_party/cutlass (v4.3.1) 2025-12-04T12:46:31.2742651Z c0b988d39a9e47c794d699f29930ed4d7c7e13a4 third_party/fbgemm (v1.4.0-rc1-2-gc0b988d39) 2025-12-04T12:46:31.2794243Z 979702c87a8713a8e0a5e9fee122b90d2ef13be5 third_party/flash-attention (v2.7.4) 2025-12-04T12:46:31.2819142Z a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757 third_party/flatbuffers (v24.12.23) 2025-12-04T12:46:31.3066225Z 407c905e45ad75fc29bf0f9bb7c5c2fd3475976f third_party/fmt (12.1.0) 2025-12-04T12:46:31.3131559Z 3fb5c176c17c765a3492cd2f0321b0dab712f350 third_party/gemmlowp/gemmlowp (remotes/origin/revert-87-master-135-g3fb5c17) 2025-12-04T12:46:31.3214760Z 54cbae0d3a67fa890b4c3d9ee162b7860315e341 third_party/gloo (remotes/origin/gh/c-p-i-o/1/base-37-g54cbae0) 2025-12-04T12:46:31.3359540Z 52eb8108c5bdec04579160ae17225d66034bd723 third_party/googletest (release-1.8.0-3544-g52eb8108) 2025-12-04T12:46:31.3424127Z 719d8e6cd7f7a0e01b155657526d693acf97c2b3 third_party/ideep (pytorch-rls-v3.7.1) 2025-12-04T12:46:31.3467215Z dec1d23ca65ab069d225dfe40dea14f455170959 third_party/ittapi (v3.25.5) 2025-12-04T12:46:31.3621501Z 31f85df8fbd89c188f14ef10f1ec65379786b943 third_party/kineto (heads/main) 2025-12-04T12:46:31.3638983Z d7770c89632329a9914ef1a90289917597639cbe third_party/kleidiai (v1.15.0) 2025-12-04T12:46:31.3656929Z fbd8b99c2b828428947d70fdc046bb55609be93e third_party/mimalloc (v2.2.4) 2025-12-04T12:46:31.3677687Z 55f93686c01528224f448c19128836e7df245f72 third_party/nlohmann (v3.12.0) 2025-12-04T12:46:31.3871948Z e709452ef2bbc1d113faf678c24e6d3467696e83 third_party/onnx (v1.18.0) 2025-12-04T12:46:31.3887884Z a799f4aed9c94b765dcdaabaeab7d5e7e2310878 third_party/opentelemetry-cpp (v1.14.2) 2025-12-04T12:46:31.3910340Z 0fa0ef591e38c2758e3184c6c23e497b9f732ffa third_party/pocketfft (release_for_eigen-40-g0fa0ef5) 2025-12-04T12:46:31.4117588Z d1eca4e4b421cd2997495c4b4e65cea6be4e9b8a third_party/protobuf (v3.7.0-rc.2-1279-gd1eca4e4b) 2025-12-04T12:46:31.4172615Z 072586a71b55b7f8c584153d223e95687148a900 third_party/psimd (heads/master) 2025-12-04T12:46:31.4207948Z 4fe0e1e183925bf8cfa6aae24237e724a96479b8 third_party/pthreadpool (0.1-144-g4fe0e1e) 2025-12-04T12:46:31.4231622Z f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8 third_party/pybind11 (v3.0.1) 2025-12-04T12:46:31.4286797Z f45429b087dd7d5bc78bb40dc7cf06425c252d67 third_party/python-peachpy (remotes/origin/pre-generated) 2025-12-04T12:46:31.4338904Z 5a1d179df9cf652951b59010a2d2075372d67f68 third_party/sleef (3.8) 2025-12-04T12:46:31.4381590Z 2b4cd91092d335a697416b2a3cb398283246849d third_party/tensorpipe (heads/main) 2025-12-04T12:46:31.4393276Z ##[group]Cleaning the repository 2025-12-04T12:46:31.4397703Z [command]/usr/bin/git clean -ffdx 2025-12-04T12:46:31.4512848Z [command]/usr/bin/git reset --hard HEAD 2025-12-04T12:46:31.5252565Z HEAD is now at 685ba6bc0117 add back legalize_graph for BC reason (#169541) 2025-12-04T12:46:31.5335452Z ##[endgroup] 2025-12-04T12:46:31.5335702Z ##[group]Disabling automatic garbage collection 2025-12-04T12:46:31.5338858Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T12:46:31.5358539Z ##[endgroup] 2025-12-04T12:46:31.5358707Z ##[group]Setting up auth 2025-12-04T12:46:31.5361998Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T12:46:31.5382346Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T12:46:31.5569797Z Entering 'android/libs/fbjni' 2025-12-04T12:46:31.5598897Z Entering 'third_party/FP16' 2025-12-04T12:46:31.5626630Z Entering 'third_party/FXdiv' 2025-12-04T12:46:31.5655251Z Entering 'third_party/NNPACK' 2025-12-04T12:46:31.5681665Z Entering 'third_party/NVTX' 2025-12-04T12:46:31.5706139Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:31.5726835Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:31.5752399Z Entering 'third_party/aiter' 2025-12-04T12:46:31.5774215Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:31.5801303Z Entering 'third_party/benchmark' 2025-12-04T12:46:31.5821929Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:31.5850879Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:31.5874764Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:31.5896497Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:31.5916698Z Entering 'third_party/cutlass' 2025-12-04T12:46:31.5940616Z Entering 'third_party/fbgemm' 2025-12-04T12:46:31.5965112Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:31.5986823Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:31.6014310Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:31.6036072Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:31.6062683Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:31.6086511Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:31.6109077Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:31.6136629Z Entering 'third_party/flash-attention' 2025-12-04T12:46:31.6164846Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:31.6189191Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:31.6216547Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:31.6242069Z Entering 'third_party/fmt' 2025-12-04T12:46:31.6265904Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:31.6288617Z Entering 'third_party/gloo' 2025-12-04T12:46:31.6311280Z Entering 'third_party/googletest' 2025-12-04T12:46:31.6334760Z Entering 'third_party/ideep' 2025-12-04T12:46:31.6357841Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:31.6384100Z Entering 'third_party/ittapi' 2025-12-04T12:46:31.6406396Z Entering 'third_party/kineto' 2025-12-04T12:46:31.6429846Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:31.6454086Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:31.6476727Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:31.6500869Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:31.6524863Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:31.6548439Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:31.6586703Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:31.6608425Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:31.6631604Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:31.6655583Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:31.6676851Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:31.6700221Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:31.6728423Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:31.6761109Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:31.6782925Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:31.6808058Z Entering 'third_party/kleidiai' 2025-12-04T12:46:31.6831789Z Entering 'third_party/mimalloc' 2025-12-04T12:46:31.6855457Z Entering 'third_party/nlohmann' 2025-12-04T12:46:31.6886029Z Entering 'third_party/onnx' 2025-12-04T12:46:31.6923621Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:31.6957541Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:31.6981299Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:31.7012798Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:31.7038638Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:31.7064413Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:31.7089647Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:31.7114975Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:31.7141307Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:31.7163866Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:31.7187948Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:31.7212583Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:31.7245241Z Entering 'third_party/pocketfft' 2025-12-04T12:46:31.7269106Z Entering 'third_party/protobuf' 2025-12-04T12:46:31.7293001Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:31.7320191Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:31.7349189Z Entering 'third_party/psimd' 2025-12-04T12:46:31.7376519Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:31.7397350Z Entering 'third_party/pybind11' 2025-12-04T12:46:31.7417362Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:31.7439135Z Entering 'third_party/sleef' 2025-12-04T12:46:31.7463046Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:31.7485827Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:31.7506876Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:31.7529129Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:31.7552306Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:31.7575082Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:31.7618266Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T12:46:31.7637310Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T12:46:31.7812855Z Entering 'android/libs/fbjni' 2025-12-04T12:46:31.7850071Z Entering 'third_party/FP16' 2025-12-04T12:46:31.7877170Z Entering 'third_party/FXdiv' 2025-12-04T12:46:31.7902157Z Entering 'third_party/NNPACK' 2025-12-04T12:46:31.7928001Z Entering 'third_party/NVTX' 2025-12-04T12:46:31.7952942Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:31.7975211Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:31.8002090Z Entering 'third_party/aiter' 2025-12-04T12:46:31.8029183Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:31.8058383Z Entering 'third_party/benchmark' 2025-12-04T12:46:31.8082637Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:31.8116972Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:31.8148916Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:31.8176956Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:31.8201112Z Entering 'third_party/cutlass' 2025-12-04T12:46:31.8224952Z Entering 'third_party/fbgemm' 2025-12-04T12:46:31.8252037Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:31.8280217Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:31.8307719Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:31.8336547Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:31.8367162Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:31.8394013Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:31.8419407Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:31.8445643Z Entering 'third_party/flash-attention' 2025-12-04T12:46:31.8473115Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:31.8500200Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:31.8525144Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:31.8552902Z Entering 'third_party/fmt' 2025-12-04T12:46:31.8583508Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:31.8606826Z Entering 'third_party/gloo' 2025-12-04T12:46:31.8628949Z Entering 'third_party/googletest' 2025-12-04T12:46:31.8651885Z Entering 'third_party/ideep' 2025-12-04T12:46:31.8679114Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:31.8705082Z Entering 'third_party/ittapi' 2025-12-04T12:46:31.8729248Z Entering 'third_party/kineto' 2025-12-04T12:46:31.8754425Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:31.8783772Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:31.8810318Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:31.8839176Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:31.8862420Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:31.8883882Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:31.8914560Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:31.8938422Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:31.8963533Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:31.8987864Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:31.9009089Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:31.9029874Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:31.9053693Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:31.9084398Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:31.9114435Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:31.9138559Z Entering 'third_party/kleidiai' 2025-12-04T12:46:31.9162315Z Entering 'third_party/mimalloc' 2025-12-04T12:46:31.9185796Z Entering 'third_party/nlohmann' 2025-12-04T12:46:31.9210767Z Entering 'third_party/onnx' 2025-12-04T12:46:31.9239957Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:31.9271731Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:31.9295260Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:31.9317673Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:31.9343238Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:31.9366014Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:31.9388840Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:31.9419954Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:31.9441445Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:31.9464664Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:31.9491144Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:31.9520175Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:31.9553352Z Entering 'third_party/pocketfft' 2025-12-04T12:46:31.9578540Z Entering 'third_party/protobuf' 2025-12-04T12:46:31.9604937Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:31.9626070Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:31.9650254Z Entering 'third_party/psimd' 2025-12-04T12:46:31.9673412Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:31.9702562Z Entering 'third_party/pybind11' 2025-12-04T12:46:31.9726962Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:31.9748226Z Entering 'third_party/sleef' 2025-12-04T12:46:31.9772392Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:31.9801835Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:31.9823716Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:31.9847265Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:31.9866904Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:31.9889952Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:31.9929464Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:31.9954002Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T12:46:32.0120184Z Entering 'android/libs/fbjni' 2025-12-04T12:46:32.0131829Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T12:46:32.0140970Z Entering 'third_party/FP16' 2025-12-04T12:46:32.0153532Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T12:46:32.0162999Z Entering 'third_party/FXdiv' 2025-12-04T12:46:32.0172658Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T12:46:32.0185947Z Entering 'third_party/NNPACK' 2025-12-04T12:46:32.0197146Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T12:46:32.0207870Z Entering 'third_party/NVTX' 2025-12-04T12:46:32.0218309Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T12:46:32.0227434Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:32.0238781Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T12:46:32.0248158Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:32.0262830Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T12:46:32.0277075Z Entering 'third_party/aiter' 2025-12-04T12:46:32.0286533Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T12:46:32.0297329Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:32.0314614Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T12:46:32.0326965Z Entering 'third_party/benchmark' 2025-12-04T12:46:32.0338436Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:32.0347903Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:32.0360422Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T12:46:32.0372207Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:32.0382324Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T12:46:32.0391648Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:32.0402029Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T12:46:32.0410445Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:32.0421102Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T12:46:32.0430820Z Entering 'third_party/cutlass' 2025-12-04T12:46:32.0440150Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T12:46:32.0452532Z Entering 'third_party/fbgemm' 2025-12-04T12:46:32.0463946Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T12:46:32.0474036Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:32.0485899Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T12:46:32.0495688Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:32.0506774Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T12:46:32.0519264Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:32.0532166Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T12:46:32.0542693Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:32.0562575Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T12:46:32.0575739Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:32.0593636Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T12:46:32.0607801Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:32.0627154Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T12:46:32.0634120Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:32.0645736Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T12:46:32.0659053Z Entering 'third_party/flash-attention' 2025-12-04T12:46:32.0669260Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T12:46:32.0688442Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:32.0706546Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T12:46:32.0720638Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:32.0733154Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T12:46:32.0754210Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:32.0767993Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T12:46:32.0779653Z Entering 'third_party/fmt' 2025-12-04T12:46:32.0793326Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:32.0803051Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:32.0813867Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T12:46:32.0823284Z Entering 'third_party/gloo' 2025-12-04T12:46:32.0833260Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T12:46:32.0843564Z Entering 'third_party/googletest' 2025-12-04T12:46:32.0853516Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.0863326Z Entering 'third_party/ideep' 2025-12-04T12:46:32.0874793Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T12:46:32.0883450Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:32.0894054Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T12:46:32.0909019Z Entering 'third_party/ittapi' 2025-12-04T12:46:32.0921573Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T12:46:32.0930754Z Entering 'third_party/kineto' 2025-12-04T12:46:32.0943005Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T12:46:32.0952706Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:32.0967930Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T12:46:32.0985717Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:32.0995439Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T12:46:32.1004758Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:32.1019754Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T12:46:32.1027418Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:32.1044358Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:32.1053564Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:32.1072986Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T12:46:32.1083029Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:32.1095211Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T12:46:32.1105268Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:32.1119011Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T12:46:32.1128747Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:32.1147021Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.1161659Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:32.1175198Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T12:46:32.1185046Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:32.1194596Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T12:46:32.1203200Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:32.1213359Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:32.1222804Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:32.1236850Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:32.1248469Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:32.1263104Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:32.1277585Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:32.1288526Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T12:46:32.1297393Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:32.1309158Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.1323702Z Entering 'third_party/kleidiai' 2025-12-04T12:46:32.1334515Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T12:46:32.1344561Z Entering 'third_party/mimalloc' 2025-12-04T12:46:32.1355348Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T12:46:32.1367574Z Entering 'third_party/nlohmann' 2025-12-04T12:46:32.1378202Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T12:46:32.1387812Z Entering 'third_party/onnx' 2025-12-04T12:46:32.1402216Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T12:46:32.1418805Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:32.1429430Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:32.1441974Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:32.1453675Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T12:46:32.1463877Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:32.1473927Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:32.1486002Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:32.1497035Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.1505265Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:32.1516483Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T12:46:32.1524306Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:32.1534297Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T12:46:32.1543189Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:32.1553140Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T12:46:32.1565067Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:32.1574926Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T12:46:32.1583162Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:32.1593643Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:32.1602904Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:32.1621851Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:32.1633090Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:32.1644976Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:32.1655316Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:32.1665424Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T12:46:32.1686173Z Entering 'third_party/pocketfft' 2025-12-04T12:46:32.1696847Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T12:46:32.1705576Z Entering 'third_party/protobuf' 2025-12-04T12:46:32.1715654Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T12:46:32.1725860Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:32.1736419Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:32.1745949Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:32.1760724Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.1772803Z Entering 'third_party/psimd' 2025-12-04T12:46:32.1785970Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T12:46:32.1799167Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:32.1813870Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T12:46:32.1824683Z Entering 'third_party/pybind11' 2025-12-04T12:46:32.1835908Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:32.1846253Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:32.1856682Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T12:46:32.1866912Z Entering 'third_party/sleef' 2025-12-04T12:46:32.1876507Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T12:46:32.1885462Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:32.1899265Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T12:46:32.1908288Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:32.1926286Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:32.1935511Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:32.1947607Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T12:46:32.1960736Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:32.1971646Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T12:46:32.1982910Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:32.1993550Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:32.2005154Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:32.2014920Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T12:46:32.2044872Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2068660Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2083842Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2098645Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2112098Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2126813Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2139419Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2156599Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2170945Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2186604Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2202666Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2218412Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2231707Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2246335Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2261001Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2276492Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2292508Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2307454Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2320758Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2336914Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2351580Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2365963Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2378960Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2394371Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2409196Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2424388Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2438301Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2453627Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2473024Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2487803Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2509133Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2523758Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2539161Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2552809Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2567308Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2582508Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2597799Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2613182Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2628897Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2643222Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2658101Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2672921Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2687872Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2702588Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2715690Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2729043Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2743362Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2757734Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2774247Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2793502Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2810384Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2825436Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2840341Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2855232Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2871579Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2886961Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2902348Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2917761Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2933344Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2948089Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2962635Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2977808Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.2992611Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3007100Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3021428Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3037011Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3052111Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3067291Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3088570Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3105609Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3120987Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3135485Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3150581Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3165551Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3180078Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3193044Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3207972Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3224184Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3239556Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3255045Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3278743Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:32.3298968Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T12:46:32.3341680Z ##[endgroup] 2025-12-04T12:46:32.3341865Z ##[group]Fetching the repository 2025-12-04T12:46:32.3345304Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T12:46:32.9481544Z From https://github.com/pytorch/pytorch 2025-12-04T12:46:32.9481997Z - [deleted] (none) -> ciflow/trunk/169475 2025-12-04T12:46:35.4636247Z * [new branch] 2.6.0.dev20241004+ -> origin/2.6.0.dev20241004+ 2025-12-04T12:46:35.4636880Z * [new branch] 2.9.1 -> origin/2.9.1 2025-12-04T12:46:35.4637465Z * [new branch] AaronWang04_addmmfusion_perftest -> origin/AaronWang04_addmmfusion_perftest 2025-12-04T12:46:35.4638118Z * [new branch] Flamefire-patch-1 -> origin/Flamefire-patch-1 2025-12-04T12:46:35.4638702Z * [new branch] HDCharles-2.6.0-release-notes -> origin/HDCharles-2.6.0-release-notes 2025-12-04T12:46:35.4639259Z * [new branch] HOPrintFunc -> origin/HOPrintFunc 2025-12-04T12:46:35.4639761Z * [new branch] IvanKobzarev/stack/1 -> origin/IvanKobzarev/stack/1 2025-12-04T12:46:35.4640254Z * [new branch] NicoshevSVE128 -> origin/NicoshevSVE128 2025-12-04T12:46:35.4640761Z * [new branch] PR-AOTInductorNoneBug -> origin/PR-AOTInductorNoneBug 2025-12-04T12:46:35.4641327Z * [new branch] PR-AOTInductorNoneBugFix -> origin/PR-AOTInductorNoneBugFix 2025-12-04T12:46:35.4641874Z * [new branch] PR-FixConfigsIssue -> origin/PR-FixConfigsIssue 2025-12-04T12:46:35.4642391Z * [new branch] PR-NoneBugFix-viable -> origin/PR-NoneBugFix-viable 2025-12-04T12:46:35.4642887Z * [new branch] PR-ResetToZero -> origin/PR-ResetToZero 2025-12-04T12:46:35.4643412Z * [new branch] Update-Flash-Packaging -> origin/Update-Flash-Packaging 2025-12-04T12:46:35.4643917Z * [new branch] VLA_exp -> origin/VLA_exp 2025-12-04T12:46:35.4644374Z * [new branch] activation_bench -> origin/activation_bench 2025-12-04T12:46:35.4644859Z * [new branch] addmm-heuristic -> origin/addmm-heuristic 2025-12-04T12:46:35.4645335Z * [new branch] adi/onednn_aarch64 -> origin/adi/onednn_aarch64 2025-12-04T12:46:35.4645796Z * [new branch] adi/test -> origin/adi/test 2025-12-04T12:46:35.4646251Z * [new branch] adi/test_bgemm -> origin/adi/test_bgemm 2025-12-04T12:46:35.4646711Z * [new branch] adi/test_m8g -> origin/adi/test_m8g 2025-12-04T12:46:35.4647168Z * [new branch] adi/test_onednn -> origin/adi/test_onednn 2025-12-04T12:46:35.4647705Z * [new branch] adi/test_onednn_v3.9 -> origin/adi/test_onednn_v3.9 2025-12-04T12:46:35.4648249Z * [new branch] adi/test_presve_change -> origin/adi/test_presve_change 2025-12-04T12:46:35.4648711Z * [new branch] adi/test_timm -> origin/adi/test_timm 2025-12-04T12:46:35.4649569Z * [new branch] adi/testpresve_change -> origin/adi/testpresve_change 2025-12-04T12:46:35.4649889Z * [new branch] aditew01/test/vec_bf16 -> origin/aditew01/test/vec_bf16 2025-12-04T12:46:35.4650204Z * [new branch] ah-globalfeedback-hook -> origin/ah-globalfeedback-hook 2025-12-04T12:46:35.4650650Z * [new branch] albanD-patch-1 -> origin/albanD-patch-1 2025-12-04T12:46:35.4650950Z * [new branch] also-surround-shimh -> origin/also-surround-shimh 2025-12-04T12:46:35.4651254Z * [new branch] angelayi/aot_compile -> origin/angelayi/aot_compile 2025-12-04T12:46:35.4651602Z * [new branch] angelayi/aoti_additional_files -> origin/angelayi/aoti_additional_files 2025-12-04T12:46:35.4651951Z * [new branch] angelayi/benchmark -> origin/angelayi/benchmark 2025-12-04T12:46:35.4652411Z * [new branch] angelayi/change_pytree_serialization -> origin/angelayi/change_pytree_serialization 2025-12-04T12:46:35.4652784Z * [new branch] angelayi/cpp_loader -> origin/angelayi/cpp_loader 2025-12-04T12:46:35.4653099Z * [new branch] angelayi/inductor_const -> origin/angelayi/inductor_const 2025-12-04T12:46:35.4653400Z * [new branch] angelayi/lstm -> origin/angelayi/lstm 2025-12-04T12:46:35.4653694Z * [new branch] angelayi/no_so_weight -> origin/angelayi/no_so_weight 2025-12-04T12:46:35.4653999Z * [new branch] angelayi/scan_layers -> origin/angelayi/scan_layers 2025-12-04T12:46:35.4654293Z * [new branch] angelayi/side_eff -> origin/angelayi/side_eff 2025-12-04T12:46:35.4654578Z * [new branch] angelayi/state_dict -> origin/angelayi/state_dict 2025-12-04T12:46:35.4654880Z * [new branch] angelayi/symint_input -> origin/angelayi/symint_input 2025-12-04T12:46:35.4655186Z * [new branch] angelayi/symm_mem -> origin/angelayi/symm_mem 2025-12-04T12:46:35.4655471Z * [new branch] angelayi/test_cpp -> origin/angelayi/test_cpp 2025-12-04T12:46:35.4655760Z * [new branch] angelayi/torch_size -> origin/angelayi/torch_size 2025-12-04T12:46:35.4656055Z * [new branch] annotate_assert -> origin/annotate_assert 2025-12-04T12:46:35.4656359Z * [new branch] annotate_fallback_kernel -> origin/annotate_fallback_kernel 2025-12-04T12:46:35.4656673Z * [new branch] annotation_deepcopy -> origin/annotation_deepcopy 2025-12-04T12:46:35.4656975Z * [new branch] annotation_dynamo -> origin/annotation_dynamo 2025-12-04T12:46:35.4657268Z * [new branch] aot_eager_stack_trace -> origin/aot_eager_stack_trace 2025-12-04T12:46:35.4657607Z * [new branch] aoti-cuda-alloc -> origin/aoti-cuda-alloc 2025-12-04T12:46:35.4657893Z * [new branch] aoti_const_device -> origin/aoti_const_device 2025-12-04T12:46:35.4658189Z * [new branch] aoti_fqn_name_interface -> origin/aoti_fqn_name_interface 2025-12-04T12:46:35.4658526Z * [new branch] aoti_package_weights_binary -> origin/aoti_package_weights_binary 2025-12-04T12:46:35.4658853Z * [new branch] aoti_target_windows -> origin/aoti_target_windows 2025-12-04T12:46:35.4659209Z * [new branch] arsh/feat/inductor_check_profiling -> origin/arsh/feat/inductor_check_profiling 2025-12-04T12:46:35.4659486Z * [new branch] async_tp -> origin/async_tp 2025-12-04T12:46:35.4659739Z * [new branch] atalman-inductor-perf-cu124 -> origin/atalman-inductor-perf-cu124 2025-12-04T12:46:35.4660039Z * [new branch] atalman-inductor-perf-cu124.1 -> origin/atalman-inductor-perf-cu124.1 2025-12-04T12:46:35.4660299Z * [new branch] atalman-patch-2 -> origin/atalman-patch-2 2025-12-04T12:46:35.4660568Z * [new branch] atalman-patch-3 -> origin/atalman-patch-3 2025-12-04T12:46:35.4660787Z * [new branch] atalman-patch-4 -> origin/atalman-patch-4 2025-12-04T12:46:35.4661004Z * [new branch] atalman-patch-5 -> origin/atalman-patch-5 2025-12-04T12:46:35.4661334Z * [new branch] atalman-patch-6 -> origin/atalman-patch-6 2025-12-04T12:46:35.4661545Z * [new branch] atalman-patch-7 -> origin/atalman-patch-7 2025-12-04T12:46:35.4661760Z * [new branch] atalman-patch-8 -> origin/atalman-patch-8 2025-12-04T12:46:35.4661984Z * [new branch] atalman_inductor_2.3.1 -> origin/atalman_inductor_2.3.1 2025-12-04T12:46:35.4662223Z * [new branch] atalman_inductor_2.4.0 -> origin/atalman_inductor_2.4.0 2025-12-04T12:46:35.4662460Z * [new branch] atalman_inductor_2.4.x -> origin/atalman_inductor_2.4.x 2025-12-04T12:46:35.4662722Z * [new branch] attention_benchmarking_clean -> origin/attention_benchmarking_clean 2025-12-04T12:46:35.4662992Z * [new branch] bahuang/dt_fix_scalar_add -> origin/bahuang/dt_fix_scalar_add 2025-12-04T12:46:35.4663239Z * [new branch] bahuang/fix_debug_mode -> origin/bahuang/fix_debug_mode 2025-12-04T12:46:35.4663472Z * [new branch] bahuang/fix_expand -> origin/bahuang/fix_expand 2025-12-04T12:46:35.4663695Z * [new branch] bahuang/test -> origin/bahuang/test 2025-12-04T12:46:35.4663902Z * [new branch] base/1.5 -> origin/base/1.5 2025-12-04T12:46:35.4664152Z * [new branch] batching_sdpa_efficient_attention -> origin/batching_sdpa_efficient_attention 2025-12-04T12:46:35.4664422Z * [new branch] bench_scaled_mm_ops -> origin/bench_scaled_mm_ops 2025-12-04T12:46:35.4664658Z * [new branch] benchmark-updates -> origin/benchmark-updates 2025-12-04T12:46:35.4664894Z * [new branch] benchmarking-script -> origin/benchmarking-script 2025-12-04T12:46:35.4665131Z * [new branch] bertmaher/pinbump26 -> origin/bertmaher/pinbump26 2025-12-04T12:46:35.4665355Z * [new branch] bertrand/cutlass -> origin/bertrand/cutlass 2025-12-04T12:46:35.4665588Z * [new branch] bf/bug-static-input -> origin/bf/bug-static-input 2025-12-04T12:46:35.4665811Z * [new branch] bf/cg-backend -> origin/bf/cg-backend 2025-12-04T12:46:35.4666020Z * [new branch] bf/cg-nccl-test -> origin/bf/cg-nccl-test 2025-12-04T12:46:35.4666236Z * [new branch] bf/cg-remove-check -> origin/bf/cg-remove-check 2025-12-04T12:46:35.4666476Z * [new branch] bf/clean-torchbench-hf -> origin/bf/clean-torchbench-hf 2025-12-04T12:46:35.4666711Z * [new branch] bf/combo-debug-log -> origin/bf/combo-debug-log 2025-12-04T12:46:35.4666935Z * [new branch] bf/cudagraph -> origin/bf/cudagraph 2025-12-04T12:46:35.4667223Z * [new branch] bf/cudagraph-disable-input-mutation -> origin/bf/cudagraph-disable-input-mutation 2025-12-04T12:46:35.4667716Z * [new branch] bf/cudagraph-enable-input-mutation-support-benchmark -> origin/bf/cudagraph-enable-input-mutation-support-benchmark 2025-12-04T12:46:35.4668101Z * [new branch] bf/cudagraph-partition -> origin/bf/cudagraph-partition 2025-12-04T12:46:35.4668351Z * [new branch] bf/donated-buffer-bench -> origin/bf/donated-buffer-bench 2025-12-04T12:46:35.4668557Z * [new branch] bf/dynamo-partition -> origin/bf/dynamo-partition 2025-12-04T12:46:35.4668730Z * [new branch] bf/lite -> origin/bf/lite 2025-12-04T12:46:35.4668902Z * [new branch] bf/pa-non-divisible -> origin/bf/pa-non-divisible 2025-12-04T12:46:35.4669627Z * [new branch] bf/partition-cache-free-symbols -> origin/bf/partition-cache-free-symbols 2025-12-04T12:46:35.4669871Z * [new branch] bf/partition-memory-plan -> origin/bf/partition-memory-plan 2025-12-04T12:46:35.4670083Z * [new branch] bf/partition-move-cpu -> origin/bf/partition-move-cpu 2025-12-04T12:46:35.4670326Z * [new branch] bf/partition-view-fallback -> origin/bf/partition-view-fallback 2025-12-04T12:46:35.4670544Z * [new branch] bf/remove-check-55b0c39d -> origin/bf/remove-check-55b0c39d 2025-12-04T12:46:35.4670743Z * [new branch] bf/timm-nov-26-2025 -> origin/bf/timm-nov-26-2025 2025-12-04T12:46:35.4670951Z * [new branch] bf/transformer-pin-4-57-3 -> origin/bf/transformer-pin-4-57-3 2025-12-04T12:46:35.4671175Z * [new branch] bisect_perf_hf_T5_3acc6eac492 -> origin/bisect_perf_hf_T5_3acc6eac492 2025-12-04T12:46:35.4671401Z * [new branch] bisect_perf_hf_T5_3fcf66f61fb -> origin/bisect_perf_hf_T5_3fcf66f61fb 2025-12-04T12:46:35.4671615Z * [new branch] bisect_perf_hf_T5_4009d154129 -> origin/bisect_perf_hf_T5_4009d154129 2025-12-04T12:46:35.4671832Z * [new branch] bisect_perf_hf_T5_40d0740e73d -> origin/bisect_perf_hf_T5_40d0740e73d 2025-12-04T12:46:35.4672042Z * [new branch] bisect_perf_hf_T5_5268754e -> origin/bisect_perf_hf_T5_5268754e 2025-12-04T12:46:35.4672250Z * [new branch] bisect_perf_hf_T5_7d89a8d385c -> origin/bisect_perf_hf_T5_7d89a8d385c 2025-12-04T12:46:35.4672463Z * [new branch] bisect_perf_hf_T5_b7a25c1ee7c -> origin/bisect_perf_hf_T5_b7a25c1ee7c 2025-12-04T12:46:35.4672680Z * [new branch] bisect_perf_hf_T5_c25b201583f -> origin/bisect_perf_hf_T5_c25b201583f 2025-12-04T12:46:35.4672889Z * [new branch] bisect_perf_hf_T5_c93e57efac0 -> origin/bisect_perf_hf_T5_c93e57efac0 2025-12-04T12:46:35.4673101Z * [new branch] bisect_perf_hf_T5_ca9813ea149 -> origin/bisect_perf_hf_T5_ca9813ea149 2025-12-04T12:46:35.4673308Z * [new branch] bisect_perf_hf_T5_d65f194a -> origin/bisect_perf_hf_T5_d65f194a 2025-12-04T12:46:35.4673510Z * [new branch] bisect_perf_hf_T5_da94ab0b -> origin/bisect_perf_hf_T5_da94ab0b 2025-12-04T12:46:35.4673717Z * [new branch] bisect_perf_hf_T5_da94ab0b_new -> origin/bisect_perf_hf_T5_da94ab0b_new 2025-12-04T12:46:35.4673933Z * [new branch] bisect_perf_hf_T5_db4e8a1d8a8 -> origin/bisect_perf_hf_T5_db4e8a1d8a8 2025-12-04T12:46:35.4674142Z * [new branch] bisect_perf_hf_T5_e0d97e936a2 -> origin/bisect_perf_hf_T5_e0d97e936a2 2025-12-04T12:46:35.4674354Z * [new branch] bisect_perf_hf_T5_f23621ec563 -> origin/bisect_perf_hf_T5_f23621ec563 2025-12-04T12:46:35.4674561Z * [new branch] brister/fx_device_type -> origin/brister/fx_device_type 2025-12-04T12:46:35.4674775Z * [new branch] brister/test_inductor_all_fx -> origin/brister/test_inductor_all_fx 2025-12-04T12:46:35.4675029Z * [new branch] brister/tiled_reduction_no_numel_check -> origin/brister/tiled_reduction_no_numel_check 2025-12-04T12:46:35.4675261Z * [new branch] bwd-backup -> origin/bwd-backup 2025-12-04T12:46:35.4675427Z * [new branch] c57382a49 -> origin/c57382a49 2025-12-04T12:46:35.4675591Z * [new branch] ca_0431d47eaa -> origin/ca_0431d47eaa 2025-12-04T12:46:35.4675756Z * [new branch] ca_fix_0431d47eaa -> origin/ca_fix_0431d47eaa 2025-12-04T12:46:35.4675956Z * [new branch] camyllh/test_setup_hooks_push -> origin/camyllh/test_setup_hooks_push 2025-12-04T12:46:35.4676175Z * [new branch] cccclai-patch-1 -> origin/cccclai-patch-1 2025-12-04T12:46:35.4676410Z * [new branch] cherry-pick-159969-by-pytorch_bot_bot_ -> origin/cherry-pick-159969-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4676710Z * [new branch] cherry-pick-160586-by-pytorch_bot_bot_ -> origin/cherry-pick-160586-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4676983Z * [new branch] cherry-pick-162208-by-pytorch_bot_bot_ -> origin/cherry-pick-162208-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4677252Z * [new branch] cherry-pick-163169-by-pytorch_bot_bot_ -> origin/cherry-pick-163169-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4677631Z * [new branch] cherry-pick-165086-by-pytorch_bot_bot_ -> origin/cherry-pick-165086-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4677902Z * [new branch] cherry-pick-165514-by-pytorch_bot_bot_ -> origin/cherry-pick-165514-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4678176Z * [new branch] cherry-pick-165601-by-pytorch_bot_bot_ -> origin/cherry-pick-165601-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4678442Z * [new branch] cherry-pick-165667-by-pytorch_bot_bot_ -> origin/cherry-pick-165667-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4678717Z * [new branch] cherry-pick-165815-by-pytorch_bot_bot_ -> origin/cherry-pick-165815-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4678984Z * [new branch] cherry-pick-165922-by-pytorch_bot_bot_ -> origin/cherry-pick-165922-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4679249Z * [new branch] cherry-pick-166148-by-pytorch_bot_bot_ -> origin/cherry-pick-166148-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4679523Z * [new branch] cherry-pick-166181-by-pytorch_bot_bot_ -> origin/cherry-pick-166181-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4679792Z * [new branch] cherry-pick-166404-by-pytorch_bot_bot_ -> origin/cherry-pick-166404-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4680056Z * [new branch] cherry-pick-166427-by-pytorch_bot_bot_ -> origin/cherry-pick-166427-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4680324Z * [new branch] cherry-pick-166480-by-pytorch_bot_bot_ -> origin/cherry-pick-166480-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4680598Z * [new branch] cherry-pick-166570-by-pytorch_bot_bot_ -> origin/cherry-pick-166570-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4680864Z * [new branch] cherry-pick-166993-by-pytorch_bot_bot_ -> origin/cherry-pick-166993-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4681132Z * [new branch] cherry-pick-167111-by-pytorch_bot_bot_ -> origin/cherry-pick-167111-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4681404Z * [new branch] cherry-pick-167478-by-pytorch_bot_bot_ -> origin/cherry-pick-167478-by-pytorch_bot_bot_ 2025-12-04T12:46:35.4681637Z * [new branch] cherry_pick_166036_166040 -> origin/cherry_pick_166036_166040 2025-12-04T12:46:35.4681828Z * [new branch] cherry_pick_166457 -> origin/cherry_pick_166457 2025-12-04T12:46:35.4682007Z * [new branch] cherrypick_166338 -> origin/cherrypick_166338 2025-12-04T12:46:35.4682183Z * [new branch] cherrypick_166458 -> origin/cherrypick_166458 2025-12-04T12:46:35.4682360Z * [new branch] cherrypick_166586 -> origin/cherrypick_166586 2025-12-04T12:46:35.4682536Z * [new branch] cherrypick_166956 -> origin/cherrypick_166956 2025-12-04T12:46:35.4682700Z * [new branch] ci_attn -> origin/ci_attn 2025-12-04T12:46:35.4682871Z * [new branch] codex-testing -> origin/codex-testing 2025-12-04T12:46:35.4683129Z * [new branch] codex/add-check_memory_overlap-helper-functions -> origin/codex/add-check_memory_overlap-helper-functions 2025-12-04T12:46:35.4683426Z * [new branch] codex/fix-issue-121219-in-pytorch -> origin/codex/fix-issue-121219-in-pytorch 2025-12-04T12:46:35.4683738Z * [new branch] codex/investigate-segfaults-in-get_tensor_storage_id -> origin/codex/investigate-segfaults-in-get_tensor_storage_id 2025-12-04T12:46:35.4684151Z * [new branch] codex/refactor-lintrunner-config-to-use-uv-run -> origin/codex/refactor-lintrunner-config-to-use-uv-run 2025-12-04T12:46:35.4684417Z * [new branch] compatiblpy39util -> origin/compatiblpy39util 2025-12-04T12:46:35.4684594Z * [new branch] cond_hop_device -> origin/cond_hop_device 2025-12-04T12:46:35.4684794Z * [new branch] context_test -> origin/context_test 2025-12-04T12:46:35.4685025Z * [new branch] copilot/code-style-cleanup-python-pip -> origin/copilot/code-style-cleanup-python-pip 2025-12-04T12:46:35.4685270Z * [new branch] cpio/fix_new_ami_tests -> origin/cpio/fix_new_ami_tests 2025-12-04T12:46:35.4685487Z * [new branch] cpp-docs-dependency-upgrade -> origin/cpp-docs-dependency-upgrade 2025-12-04T12:46:35.4685732Z * [new branch] crpa/typo-in-inductor_comm_lowering -> origin/crpa/typo-in-inductor_comm_lowering 2025-12-04T12:46:35.4685960Z * [new branch] csl/always_produce_xml -> origin/csl/always_produce_xml 2025-12-04T12:46:35.4686162Z * [new branch] csl/build_test_more_procs -> origin/csl/build_test_more_procs 2025-12-04T12:46:35.4686364Z * [new branch] csl/build_test_more_procs2 -> origin/csl/build_test_more_procs2 2025-12-04T12:46:35.4686552Z * [new branch] csl/clean_up -> origin/csl/clean_up 2025-12-04T12:46:35.4686744Z * [new branch] csl/fix_retry_segfault_exit -> origin/csl/fix_retry_segfault_exit 2025-12-04T12:46:35.4686929Z * [new branch] csl/katex -> origin/csl/katex 2025-12-04T12:46:35.4687100Z * [new branch] csl/larger_runner -> origin/csl/larger_runner 2025-12-04T12:46:35.4687274Z * [new branch] csl/lint_testing -> origin/csl/lint_testing 2025-12-04T12:46:35.4687445Z * [new branch] csl/lint_thing -> origin/csl/lint_thing 2025-12-04T12:46:35.4687668Z * [new branch] csl/lintrunner_stuff -> origin/csl/lintrunner_stuff 2025-12-04T12:46:35.4687858Z * [new branch] csl/manually_gen_json -> origin/csl/manually_gen_json 2025-12-04T12:46:35.4688040Z * [new branch] csl/mps_sharding -> origin/csl/mps_sharding 2025-12-04T12:46:35.4688221Z * [new branch] csl/multistage_docker -> origin/csl/multistage_docker 2025-12-04T12:46:35.4688405Z * [new branch] csl/print_timing -> origin/csl/print_timing 2025-12-04T12:46:35.4688584Z * [new branch] csl/remove_experiment -> origin/csl/remove_experiment 2025-12-04T12:46:35.4688783Z * [new branch] csl/remove_maybe_unused_var -> origin/csl/remove_maybe_unused_var 2025-12-04T12:46:35.4689014Z * [new branch] csl/remove_repo_specific_autolabel -> origin/csl/remove_repo_specific_autolabel 2025-12-04T12:46:35.4689234Z * [new branch] csl/remove_run_parallel -> origin/csl/remove_run_parallel 2025-12-04T12:46:35.4689427Z * [new branch] csl/remove_unused_vars -> origin/csl/remove_unused_vars 2025-12-04T12:46:35.4689605Z * [new branch] csl/revert_open -> origin/csl/revert_open 2025-12-04T12:46:35.4689780Z * [new branch] csl/skip_build -> origin/csl/skip_build 2025-12-04T12:46:35.4689974Z * [new branch] csl/smaller_avx_amx_runenrs -> origin/csl/smaller_avx_amx_runenrs 2025-12-04T12:46:35.4690162Z * [new branch] csl/td_job_level -> origin/csl/td_job_level 2025-12-04T12:46:35.4690370Z * [new branch] csl/test_cuda_build_large_runner -> origin/csl/test_cuda_build_large_runner 2025-12-04T12:46:35.4690616Z * [new branch] csl/test_owners_autograd_dispatch_nn -> origin/csl/test_owners_autograd_dispatch_nn 2025-12-04T12:46:35.4690864Z * [new branch] csl/test_owners_higher_confidence -> origin/csl/test_owners_higher_confidence 2025-12-04T12:46:35.4691119Z * [new branch] csl/upload_json_running -> origin/csl/upload_json_running 2025-12-04T12:46:35.4691302Z * [new branch] csl/win_sccache -> origin/csl/win_sccache 2025-12-04T12:46:35.4691468Z * [new branch] csl/xml_stuff -> origin/csl/xml_stuff 2025-12-04T12:46:35.4691636Z * [new branch] cublasrelax2 -> origin/cublasrelax2 2025-12-04T12:46:35.4691844Z * [new branch] cuda_mempool -> origin/cuda_mempool 2025-12-04T12:46:35.4692018Z * [new branch] custom_lowering_dict -> origin/custom_lowering_dict 2025-12-04T12:46:35.4692214Z * [new branch] d4l3k/debug_plane_frtrace -> origin/d4l3k/debug_plane_frtrace 2025-12-04T12:46:35.4692399Z * [new branch] daxia6/2.8o3 -> origin/daxia6/2.8o3 2025-12-04T12:46:35.4692564Z * [new branch] debug-guard -> origin/debug-guard 2025-12-04T12:46:35.4692745Z * [new branch] delete-quant-docs -> origin/delete-quant-docs 2025-12-04T12:46:35.4693071Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 2025-12-04T12:46:35.4693516Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 2025-12-04T12:46:35.4693847Z * [new branch] desertfire/test_cpp_wrapper -> origin/desertfire/test_cpp_wrapper 2025-12-04T12:46:35.4694089Z * [new branch] desertfire/triton-cpu-for-aarch64 -> origin/desertfire/triton-cpu-for-aarch64 2025-12-04T12:46:35.4694316Z * [new branch] dev/dhruva/flex_attn_opt -> origin/dev/dhruva/flex_attn_opt 2025-12-04T12:46:35.4694532Z * [new branch] dev/joona/MPSNDArrayAdd -> origin/dev/joona/MPSNDArrayAdd 2025-12-04T12:46:35.4694725Z * [new branch] dev/joona/Unranked -> origin/dev/joona/Unranked 2025-12-04T12:46:35.4694899Z * [new branch] dev/joona/cat -> origin/dev/joona/cat 2025-12-04T12:46:35.4695085Z * [new branch] dev/joona/embeddingbag -> origin/dev/joona/embeddingbag 2025-12-04T12:46:35.4695286Z * [new branch] dev/joona/fix_sdpa_memtest -> origin/dev/joona/fix_sdpa_memtest 2025-12-04T12:46:35.4695500Z * [new branch] dev/joona/getTensorsString -> origin/dev/joona/getTensorsString 2025-12-04T12:46:35.4695718Z * [new branch] dev/joona/mps_linear_macos14 -> origin/dev/joona/mps_linear_macos14 2025-12-04T12:46:35.4695931Z * [new branch] dev/joona/scalar_clamp -> origin/dev/joona/scalar_clamp 2025-12-04T12:46:35.4696112Z * [new branch] dev/joona/sdpa -> origin/dev/joona/sdpa 2025-12-04T12:46:35.4696292Z * [new branch] dev/joona/sdpa_api -> origin/dev/joona/sdpa_api 2025-12-04T12:46:35.4696475Z * [new branch] dev/joona/type_inf -> origin/dev/joona/type_inf 2025-12-04T12:46:35.4696668Z * [new branch] dev/joona/ulpAssertClose -> origin/dev/joona/ulpAssertClose 2025-12-04T12:46:35.4696861Z * [new branch] dev/joona/upsize3d -> origin/dev/joona/upsize3d 2025-12-04T12:46:35.4697033Z * [new branch] disp_counter -> origin/disp_counter 2025-12-04T12:46:35.4697213Z * [new branch] divyanshk-patch-1 -> origin/divyanshk-patch-1 2025-12-04T12:46:35.4697386Z * [new branch] docs -> origin/docs 2025-12-04T12:46:35.4697610Z * [new branch] documentation -> origin/documentation 2025-12-04T12:46:35.4697791Z * [new branch] eager_model_benchmarks -> origin/eager_model_benchmarks 2025-12-04T12:46:35.4698004Z * [new branch] embg/test_inductor_ci_control -> origin/embg/test_inductor_ci_control 2025-12-04T12:46:35.4698251Z * [new branch] embg/triton_l2_prefetch_128B -> origin/embg/triton_l2_prefetch_128B 2025-12-04T12:46:35.4698470Z * [new branch] embg/triton_l2_prefetch_256B -> origin/embg/triton_l2_prefetch_256B 2025-12-04T12:46:35.4698661Z * [new branch] eqy-patch-1 -> origin/eqy-patch-1 2025-12-04T12:46:35.4698863Z * [new branch] eqy-patch-2 -> origin/eqy-patch-2 2025-12-04T12:46:35.4699027Z * [new branch] eqy-patch-3 -> origin/eqy-patch-3 2025-12-04T12:46:35.4699195Z * [new branch] eqy-patch-4 -> origin/eqy-patch-4 2025-12-04T12:46:35.4699356Z * [new branch] eqy-patch-5 -> origin/eqy-patch-5 2025-12-04T12:46:35.4699518Z * [new branch] eqy-patch-6 -> origin/eqy-patch-6 2025-12-04T12:46:35.4699695Z * [new branch] exclamaforte/amd-ma -> origin/exclamaforte/amd-ma 2025-12-04T12:46:35.4699927Z * [new branch] exclamaforte/combo-kernels-perf-run -> origin/exclamaforte/combo-kernels-perf-run 2025-12-04T12:46:35.4700179Z * [new branch] exclamaforte/do_bench_refactor -> origin/exclamaforte/do_bench_refactor 2025-12-04T12:46:35.4700428Z * [new branch] exclamaforte/enable-mem-dep-fusion -> origin/exclamaforte/enable-mem-dep-fusion 2025-12-04T12:46:35.4700707Z * [new branch] exclamaforte/fix-exhaustive-autotuning -> origin/exclamaforte/fix-exhaustive-autotuning 2025-12-04T12:46:35.4700997Z * [new branch] exclamaforte/fix-trace-parsing-fx-svg -> origin/exclamaforte/fix-trace-parsing-fx-svg 2025-12-04T12:46:35.4701294Z * [new branch] exclamaforte/force-pointwise-cat-perf-run -> origin/exclamaforte/force-pointwise-cat-perf-run 2025-12-04T12:46:35.4701552Z * [new branch] exclamaforte/fusion-data -> origin/exclamaforte/fusion-data 2025-12-04T12:46:35.4701780Z * [new branch] exclamaforte/gemm-benchmark-run -> origin/exclamaforte/gemm-benchmark-run 2025-12-04T12:46:35.4702023Z * [new branch] exclamaforte/gemm-export-model -> origin/exclamaforte/gemm-export-model 2025-12-04T12:46:35.4702239Z * [new branch] exclamaforte/gemm-model -> origin/exclamaforte/gemm-model 2025-12-04T12:46:35.4702502Z * [new branch] exclamaforte/gemm-model-all-data-collection -> origin/exclamaforte/gemm-model-all-data-collection 2025-12-04T12:46:35.4702766Z * [new branch] exclamaforte/gemm-to-amd -> origin/exclamaforte/gemm-to-amd 2025-12-04T12:46:35.4702981Z * [new branch] exclamaforte/just-gemm-model -> origin/exclamaforte/just-gemm-model 2025-12-04T12:46:35.4703245Z * [new branch] exclamaforte/just-gemm-model-no-refactor -> origin/exclamaforte/just-gemm-model-no-refactor 2025-12-04T12:46:35.4703515Z * [new branch] exclamaforte/profile-diff-algo -> origin/exclamaforte/profile-diff-algo 2025-12-04T12:46:35.4703773Z * [new branch] exclamaforte/profiler-visualization -> origin/exclamaforte/profiler-visualization 2025-12-04T12:46:35.4704037Z * [new branch] exclamaforte/test_cpp_wrapper_mode -> origin/exclamaforte/test_cpp_wrapper_mode 2025-12-04T12:46:35.4704300Z * [new branch] exclamaforte/update-autotune-configs -> origin/exclamaforte/update-autotune-configs 2025-12-04T12:46:35.4704585Z * [new branch] exclamaforte/update-autotune-configs-2 -> origin/exclamaforte/update-autotune-configs-2 2025-12-04T12:46:35.4704810Z * [new branch] exec -> origin/exec 2025-12-04T12:46:35.4704981Z * [new branch] experimental-mosaic -> origin/experimental-mosaic 2025-12-04T12:46:35.4705164Z * [new branch] export-D61047529 -> origin/export-D61047529 2025-12-04T12:46:35.4705339Z * [new branch] export-D71412006 -> origin/export-D71412006 2025-12-04T12:46:35.4705608Z * [new branch] export-D73042989 -> origin/export-D73042989 2025-12-04T12:46:35.4705784Z * [new branch] export-D78957093 -> origin/export-D78957093 2025-12-04T12:46:35.4705953Z * [new branch] export-D78996107 -> origin/export-D78996107 2025-12-04T12:46:35.4706119Z * [new branch] export-D80823877 -> origin/export-D80823877 2025-12-04T12:46:35.4706311Z * [new branch] export-D80958642 -> origin/export-D80958642 2025-12-04T12:46:35.4706481Z * [new branch] export-D81054193 -> origin/export-D81054193 2025-12-04T12:46:35.4706650Z * [new branch] export-D81204584 -> origin/export-D81204584 2025-12-04T12:46:35.4706818Z * [new branch] export-D81429090 -> origin/export-D81429090 2025-12-04T12:46:35.4706992Z * [new branch] export-D82250826 -> origin/export-D82250826 2025-12-04T12:46:35.4707161Z * [new branch] export-D82253817 -> origin/export-D82253817 2025-12-04T12:46:35.4707335Z * [new branch] export-D83541846 -> origin/export-D83541846 2025-12-04T12:46:35.4707564Z * [new branch] export-D83627170 -> origin/export-D83627170 2025-12-04T12:46:35.4707732Z * [new branch] export-D83766701 -> origin/export-D83766701 2025-12-04T12:46:35.4707903Z * [new branch] export-D83768878 -> origin/export-D83768878 2025-12-04T12:46:35.4708076Z * [new branch] export-D83769447 -> origin/export-D83769447 2025-12-04T12:46:35.4708244Z * [new branch] export-D84089824 -> origin/export-D84089824 2025-12-04T12:46:35.4708413Z * [new branch] export-D84213020 -> origin/export-D84213020 2025-12-04T12:46:35.4708582Z * [new branch] export-D84373821 -> origin/export-D84373821 2025-12-04T12:46:35.4708749Z * [new branch] export-D84612194 -> origin/export-D84612194 2025-12-04T12:46:35.4708922Z * [new branch] export-D84890985 -> origin/export-D84890985 2025-12-04T12:46:35.4709090Z * [new branch] export-D85122326 -> origin/export-D85122326 2025-12-04T12:46:35.4709256Z * [new branch] export-D86256198 -> origin/export-D86256198 2025-12-04T12:46:35.4709427Z * [new branch] export-D86460608 -> origin/export-D86460608 2025-12-04T12:46:35.4709593Z * [new branch] export-D86474796 -> origin/export-D86474796 2025-12-04T12:46:35.4709765Z * [new branch] export-D86712396 -> origin/export-D86712396 2025-12-04T12:46:35.4709933Z * [new branch] export-D87022129 -> origin/export-D87022129 2025-12-04T12:46:35.4710098Z * [new branch] export-D87838959 -> origin/export-D87838959 2025-12-04T12:46:35.4710268Z * [new branch] export-D88319437 -> origin/export-D88319437 2025-12-04T12:46:35.4710487Z * [new branch] exported-model-train-idempotent -> origin/exported-model-train-idempotent 2025-12-04T12:46:35.4710717Z * [new branch] ezyang-titan-october -> origin/ezyang-titan-october 2025-12-04T12:46:35.4710914Z * [new branch] ezyang-titan-october2 -> origin/ezyang-titan-october2 2025-12-04T12:46:35.4711099Z * [new branch] ezyang-war -> origin/ezyang-war 2025-12-04T12:46:35.4711291Z * [new branch] ezyang/wip-aot-descriptors -> origin/ezyang/wip-aot-descriptors 2025-12-04T12:46:35.4711484Z * [new branch] fa_u8_brgemm -> origin/fa_u8_brgemm 2025-12-04T12:46:35.4711672Z * [new branch] fadeputr/sequence_fbgemm -> origin/fadeputr/sequence_fbgemm 2025-12-04T12:46:35.4711860Z * [new branch] fastmath_baseline -> origin/fastmath_baseline 2025-12-04T12:46:35.4712032Z * [new branch] fbcode/warm -> origin/fbcode/warm 2025-12-04T12:46:35.4712230Z * [new branch] fca -> origin/fca 2025-12-04T12:46:35.4712387Z * [new branch] fca2_ca5984c -> origin/fca2_ca5984c 2025-12-04T12:46:35.4712546Z * [new branch] fca5 -> origin/fca5 2025-12-04T12:46:35.4712722Z * [new branch] feature/justknobs-cpp -> origin/feature/justknobs-cpp 2025-12-04T12:46:35.4712942Z * [new branch] feature/numa-forkserver -> origin/feature/numa-forkserver 2025-12-04T12:46:35.4713132Z * [new branch] ffast_math_baseline -> origin/ffast_math_baseline 2025-12-04T12:46:35.4713307Z * [new branch] ffast_math_target -> origin/ffast_math_target 2025-12-04T12:46:35.4713487Z * [new branch] findhao/base_commit -> origin/findhao/base_commit 2025-12-04T12:46:35.4713672Z * [new branch] findhao/base_commit1 -> origin/findhao/base_commit1 2025-12-04T12:46:35.4713858Z * [new branch] findhao/multistream2 -> origin/findhao/multistream2 2025-12-04T12:46:35.4714043Z * [new branch] findhao/multistream5 -> origin/findhao/multistream5 2025-12-04T12:46:35.4714228Z * [new branch] findhao/multistream6 -> origin/findhao/multistream6 2025-12-04T12:46:35.4714418Z * [new branch] findhao/operatorbench3 -> origin/findhao/operatorbench3 2025-12-04T12:46:35.4714623Z * [new branch] findhao/operatorbench5 -> origin/findhao/operatorbench5 2025-12-04T12:46:35.4714815Z * [new branch] findhao/tritonparse -> origin/findhao/tritonparse 2025-12-04T12:46:35.4715026Z * [new branch] fix-ck-gemm-template-format -> origin/fix-ck-gemm-template-format 2025-12-04T12:46:35.4715234Z * [new branch] fix-config-ignore -> origin/fix-config-ignore 2025-12-04T12:46:35.4715416Z * [new branch] fix-dict-guard -> origin/fix-dict-guard 2025-12-04T12:46:35.4715588Z * [new branch] fix_addmm_issue -> origin/fix_addmm_issue 2025-12-04T12:46:35.4715783Z * [new branch] fix_amd_missing_cluster_dims -> origin/fix_amd_missing_cluster_dims 2025-12-04T12:46:35.4715981Z * [new branch] fix_bench_bwd_pass -> origin/fix_bench_bwd_pass 2025-12-04T12:46:35.4716165Z * [new branch] fix_mem_profiler_config -> origin/fix_mem_profiler_config 2025-12-04T12:46:35.4716351Z * [new branch] fix_nvrtc_discovery -> origin/fix_nvrtc_discovery 2025-12-04T12:46:35.4716526Z * [new branch] fix_op_runner -> origin/fix_op_runner 2025-12-04T12:46:35.4716689Z * [new branch] fix_ubn_159469 -> origin/fix_ubn_159469 2025-12-04T12:46:35.4716856Z * [new branch] fixes-triage -> origin/fixes-triage 2025-12-04T12:46:35.4717025Z * [new branch] fixflashinfer -> origin/fixflashinfer 2025-12-04T12:46:35.4717202Z * [new branch] flash_decoding_cpu -> origin/flash_decoding_cpu 2025-12-04T12:46:35.4717375Z * [new branch] flex-flash -> origin/flex-flash 2025-12-04T12:46:35.4717610Z * [new branch] flex_attention_functorch_grad -> origin/flex_attention_functorch_grad 2025-12-04T12:46:35.4717805Z * [new branch] flex_flash -> origin/flex_flash 2025-12-04T12:46:35.4718005Z * [new branch] fmassa/fix_memeff_sharding_rule -> origin/fmassa/fix_memeff_sharding_rule 2025-12-04T12:46:35.4718248Z * [new branch] fmassa/tests_comm_compute_scheduler -> origin/fmassa/tests_comm_compute_scheduler 2025-12-04T12:46:35.4718462Z * [new branch] forkserver_fix -> origin/forkserver_fix 2025-12-04T12:46:35.4718634Z * [new branch] fsdp2_trace_rules -> origin/fsdp2_trace_rules 2025-12-04T12:46:35.4718800Z * [new branch] fx_cpp -> origin/fx_cpp 2025-12-04T12:46:35.4718998Z * [new branch] fy/fix-win -> origin/fy/fix-win 2025-12-04T12:46:35.4719169Z * [new branch] galv-patch-1 -> origin/galv-patch-1 2025-12-04T12:46:35.4719394Z * [new branch] galv/cudagraphs-conditional-nodes-4 -> origin/galv/cudagraphs-conditional-nodes-4 2025-12-04T12:46:35.4719675Z * [new branch] georgehong/cmakelists-patch -> origin/georgehong/cmakelists-patch 2025-12-04T12:46:35.4719885Z * [new branch] gh/AlnisM/1/base -> origin/gh/AlnisM/1/base 2025-12-04T12:46:35.4720058Z * [new branch] gh/AlnisM/1/head -> origin/gh/AlnisM/1/head 2025-12-04T12:46:35.4720238Z * [new branch] gh/EikanWang/67/base -> origin/gh/EikanWang/67/base 2025-12-04T12:46:35.4720425Z * [new branch] gh/EikanWang/67/head -> origin/gh/EikanWang/67/head 2025-12-04T12:46:35.4720607Z * [new branch] gh/Gasoonjia/1/base -> origin/gh/Gasoonjia/1/base 2025-12-04T12:46:35.4720794Z * [new branch] gh/Gasoonjia/1/head -> origin/gh/Gasoonjia/1/head 2025-12-04T12:46:35.4720975Z * [new branch] gh/H-Huang/131/base -> origin/gh/H-Huang/131/base 2025-12-04T12:46:35.4721151Z * [new branch] gh/H-Huang/131/head -> origin/gh/H-Huang/131/head 2025-12-04T12:46:35.4721328Z * [new branch] gh/H-Huang/131/orig -> origin/gh/H-Huang/131/orig 2025-12-04T12:46:35.4721503Z * [new branch] gh/H-Huang/132/base -> origin/gh/H-Huang/132/base 2025-12-04T12:46:35.4721677Z * [new branch] gh/H-Huang/132/head -> origin/gh/H-Huang/132/head 2025-12-04T12:46:35.4721854Z * [new branch] gh/H-Huang/132/orig -> origin/gh/H-Huang/132/orig 2025-12-04T12:46:35.4722027Z * [new branch] gh/H-Huang/180/base -> origin/gh/H-Huang/180/base 2025-12-04T12:46:35.4722204Z * [new branch] gh/H-Huang/180/head -> origin/gh/H-Huang/180/head 2025-12-04T12:46:35.4722380Z * [new branch] gh/H-Huang/180/orig -> origin/gh/H-Huang/180/orig 2025-12-04T12:46:35.4722553Z * [new branch] gh/H-Huang/182/base -> origin/gh/H-Huang/182/base 2025-12-04T12:46:35.4722728Z * [new branch] gh/H-Huang/182/head -> origin/gh/H-Huang/182/head 2025-12-04T12:46:35.4722907Z * [new branch] gh/H-Huang/182/orig -> origin/gh/H-Huang/182/orig 2025-12-04T12:46:35.4723083Z * [new branch] gh/H-Huang/226/base -> origin/gh/H-Huang/226/base 2025-12-04T12:46:35.4723261Z * [new branch] gh/H-Huang/226/head -> origin/gh/H-Huang/226/head 2025-12-04T12:46:35.4723437Z * [new branch] gh/H-Huang/226/orig -> origin/gh/H-Huang/226/orig 2025-12-04T12:46:35.4723609Z * [new branch] gh/H-Huang/228/base -> origin/gh/H-Huang/228/base 2025-12-04T12:46:35.4723784Z * [new branch] gh/H-Huang/228/head -> origin/gh/H-Huang/228/head 2025-12-04T12:46:35.4723960Z * [new branch] gh/H-Huang/228/orig -> origin/gh/H-Huang/228/orig 2025-12-04T12:46:35.4724148Z * [new branch] gh/IvanKobzarev/150/base -> origin/gh/IvanKobzarev/150/base 2025-12-04T12:46:35.4724349Z * [new branch] gh/IvanKobzarev/150/head -> origin/gh/IvanKobzarev/150/head 2025-12-04T12:46:35.4724556Z * [new branch] gh/IvanKobzarev/150/orig -> origin/gh/IvanKobzarev/150/orig 2025-12-04T12:46:35.4724751Z * [new branch] gh/IvanKobzarev/157/base -> origin/gh/IvanKobzarev/157/base 2025-12-04T12:46:35.4724948Z * [new branch] gh/IvanKobzarev/157/head -> origin/gh/IvanKobzarev/157/head 2025-12-04T12:46:35.4725145Z * [new branch] gh/IvanKobzarev/157/orig -> origin/gh/IvanKobzarev/157/orig 2025-12-04T12:46:35.4725341Z * [new branch] gh/IvanKobzarev/159/base -> origin/gh/IvanKobzarev/159/base 2025-12-04T12:46:35.4725562Z * [new branch] gh/IvanKobzarev/159/head -> origin/gh/IvanKobzarev/159/head 2025-12-04T12:46:35.4725759Z * [new branch] gh/IvanKobzarev/159/orig -> origin/gh/IvanKobzarev/159/orig 2025-12-04T12:46:35.4725952Z * [new branch] gh/IvanKobzarev/162/base -> origin/gh/IvanKobzarev/162/base 2025-12-04T12:46:35.4726151Z * [new branch] gh/IvanKobzarev/162/head -> origin/gh/IvanKobzarev/162/head 2025-12-04T12:46:35.4726367Z * [new branch] gh/IvanKobzarev/162/orig -> origin/gh/IvanKobzarev/162/orig 2025-12-04T12:46:35.4726563Z * [new branch] gh/IvanKobzarev/163/base -> origin/gh/IvanKobzarev/163/base 2025-12-04T12:46:35.4726758Z * [new branch] gh/IvanKobzarev/163/head -> origin/gh/IvanKobzarev/163/head 2025-12-04T12:46:35.4726952Z * [new branch] gh/IvanKobzarev/163/orig -> origin/gh/IvanKobzarev/163/orig 2025-12-04T12:46:35.4727150Z * [new branch] gh/IvanKobzarev/166/base -> origin/gh/IvanKobzarev/166/base 2025-12-04T12:46:35.4727347Z * [new branch] gh/IvanKobzarev/166/head -> origin/gh/IvanKobzarev/166/head 2025-12-04T12:46:35.4727600Z * [new branch] gh/IvanKobzarev/166/orig -> origin/gh/IvanKobzarev/166/orig 2025-12-04T12:46:35.4727798Z * [new branch] gh/IvanKobzarev/167/base -> origin/gh/IvanKobzarev/167/base 2025-12-04T12:46:35.4727999Z * [new branch] gh/IvanKobzarev/167/head -> origin/gh/IvanKobzarev/167/head 2025-12-04T12:46:35.4728193Z * [new branch] gh/IvanKobzarev/167/orig -> origin/gh/IvanKobzarev/167/orig 2025-12-04T12:46:35.4728390Z * [new branch] gh/IvanKobzarev/168/base -> origin/gh/IvanKobzarev/168/base 2025-12-04T12:46:35.4728586Z * [new branch] gh/IvanKobzarev/168/head -> origin/gh/IvanKobzarev/168/head 2025-12-04T12:46:35.4728779Z * [new branch] gh/IvanKobzarev/168/orig -> origin/gh/IvanKobzarev/168/orig 2025-12-04T12:46:35.4728981Z * [new branch] gh/IvanKobzarev/169/base -> origin/gh/IvanKobzarev/169/base 2025-12-04T12:46:35.4729176Z * [new branch] gh/IvanKobzarev/169/head -> origin/gh/IvanKobzarev/169/head 2025-12-04T12:46:35.4729370Z * [new branch] gh/IvanKobzarev/169/orig -> origin/gh/IvanKobzarev/169/orig 2025-12-04T12:46:35.4729565Z * [new branch] gh/IvanKobzarev/170/base -> origin/gh/IvanKobzarev/170/base 2025-12-04T12:46:35.4729764Z * [new branch] gh/IvanKobzarev/170/head -> origin/gh/IvanKobzarev/170/head 2025-12-04T12:46:35.4729956Z * [new branch] gh/IvanKobzarev/170/orig -> origin/gh/IvanKobzarev/170/orig 2025-12-04T12:46:35.4730154Z * [new branch] gh/IvanKobzarev/171/base -> origin/gh/IvanKobzarev/171/base 2025-12-04T12:46:35.4730353Z * [new branch] gh/IvanKobzarev/171/head -> origin/gh/IvanKobzarev/171/head 2025-12-04T12:46:35.4730547Z * [new branch] gh/IvanKobzarev/171/orig -> origin/gh/IvanKobzarev/171/orig 2025-12-04T12:46:35.4730744Z * [new branch] gh/IvanKobzarev/172/base -> origin/gh/IvanKobzarev/172/base 2025-12-04T12:46:35.4730941Z * [new branch] gh/IvanKobzarev/172/head -> origin/gh/IvanKobzarev/172/head 2025-12-04T12:46:35.4731136Z * [new branch] gh/IvanKobzarev/172/orig -> origin/gh/IvanKobzarev/172/orig 2025-12-04T12:46:35.4731334Z * [new branch] gh/IvanKobzarev/173/base -> origin/gh/IvanKobzarev/173/base 2025-12-04T12:46:35.4731531Z * [new branch] gh/IvanKobzarev/173/head -> origin/gh/IvanKobzarev/173/head 2025-12-04T12:46:35.4731725Z * [new branch] gh/IvanKobzarev/173/orig -> origin/gh/IvanKobzarev/173/orig 2025-12-04T12:46:35.4731922Z * [new branch] gh/IvanKobzarev/174/base -> origin/gh/IvanKobzarev/174/base 2025-12-04T12:46:35.4732118Z * [new branch] gh/IvanKobzarev/174/head -> origin/gh/IvanKobzarev/174/head 2025-12-04T12:46:35.4732361Z * [new branch] gh/IvanKobzarev/174/orig -> origin/gh/IvanKobzarev/174/orig 2025-12-04T12:46:35.4732560Z * [new branch] gh/IvanKobzarev/175/base -> origin/gh/IvanKobzarev/175/base 2025-12-04T12:46:35.4732756Z * [new branch] gh/IvanKobzarev/175/head -> origin/gh/IvanKobzarev/175/head 2025-12-04T12:46:35.4732951Z * [new branch] gh/IvanKobzarev/175/orig -> origin/gh/IvanKobzarev/175/orig 2025-12-04T12:46:35.4733174Z * [new branch] gh/IvanKobzarev/176/base -> origin/gh/IvanKobzarev/176/base 2025-12-04T12:46:35.4733367Z * [new branch] gh/IvanKobzarev/176/head -> origin/gh/IvanKobzarev/176/head 2025-12-04T12:46:35.4733566Z * [new branch] gh/IvanKobzarev/176/orig -> origin/gh/IvanKobzarev/176/orig 2025-12-04T12:46:35.4733764Z * [new branch] gh/IvanKobzarev/177/base -> origin/gh/IvanKobzarev/177/base 2025-12-04T12:46:35.4733957Z * [new branch] gh/IvanKobzarev/177/head -> origin/gh/IvanKobzarev/177/head 2025-12-04T12:46:35.4734155Z * [new branch] gh/IvanKobzarev/177/orig -> origin/gh/IvanKobzarev/177/orig 2025-12-04T12:46:35.4734351Z * [new branch] gh/IvanKobzarev/178/base -> origin/gh/IvanKobzarev/178/base 2025-12-04T12:46:35.4734547Z * [new branch] gh/IvanKobzarev/178/head -> origin/gh/IvanKobzarev/178/head 2025-12-04T12:46:35.4734744Z * [new branch] gh/IvanKobzarev/178/orig -> origin/gh/IvanKobzarev/178/orig 2025-12-04T12:46:35.4734939Z * [new branch] gh/IvanKobzarev/179/base -> origin/gh/IvanKobzarev/179/base 2025-12-04T12:46:35.4735132Z * [new branch] gh/IvanKobzarev/179/head -> origin/gh/IvanKobzarev/179/head 2025-12-04T12:46:35.4735329Z * [new branch] gh/IvanKobzarev/179/orig -> origin/gh/IvanKobzarev/179/orig 2025-12-04T12:46:35.4735524Z * [new branch] gh/IvanKobzarev/180/base -> origin/gh/IvanKobzarev/180/base 2025-12-04T12:46:35.4735724Z * [new branch] gh/IvanKobzarev/180/head -> origin/gh/IvanKobzarev/180/head 2025-12-04T12:46:35.4735919Z * [new branch] gh/IvanKobzarev/180/orig -> origin/gh/IvanKobzarev/180/orig 2025-12-04T12:46:35.4736114Z * [new branch] gh/IvanKobzarev/181/base -> origin/gh/IvanKobzarev/181/base 2025-12-04T12:46:35.4736308Z * [new branch] gh/IvanKobzarev/181/head -> origin/gh/IvanKobzarev/181/head 2025-12-04T12:46:35.4736733Z * [new branch] gh/IvanKobzarev/181/orig -> origin/gh/IvanKobzarev/181/orig 2025-12-04T12:46:35.4736927Z * [new branch] gh/IvanKobzarev/182/base -> origin/gh/IvanKobzarev/182/base 2025-12-04T12:46:35.4737121Z * [new branch] gh/IvanKobzarev/182/head -> origin/gh/IvanKobzarev/182/head 2025-12-04T12:46:35.4737317Z * [new branch] gh/IvanKobzarev/182/orig -> origin/gh/IvanKobzarev/182/orig 2025-12-04T12:46:35.4737575Z * [new branch] gh/IvanKobzarev/183/base -> origin/gh/IvanKobzarev/183/base 2025-12-04T12:46:35.4737773Z * [new branch] gh/IvanKobzarev/183/head -> origin/gh/IvanKobzarev/183/head 2025-12-04T12:46:35.4737971Z * [new branch] gh/IvanKobzarev/183/orig -> origin/gh/IvanKobzarev/183/orig 2025-12-04T12:46:35.4738164Z * [new branch] gh/IvanKobzarev/184/base -> origin/gh/IvanKobzarev/184/base 2025-12-04T12:46:35.4738364Z * [new branch] gh/IvanKobzarev/184/head -> origin/gh/IvanKobzarev/184/head 2025-12-04T12:46:35.4738559Z * [new branch] gh/IvanKobzarev/184/orig -> origin/gh/IvanKobzarev/184/orig 2025-12-04T12:46:35.4738756Z * [new branch] gh/NikhilAPatel/1/base -> origin/gh/NikhilAPatel/1/base 2025-12-04T12:46:35.4738952Z * [new branch] gh/NikhilAPatel/1/head -> origin/gh/NikhilAPatel/1/head 2025-12-04T12:46:35.4739152Z * [new branch] gh/NikhilAPatel/2/base -> origin/gh/NikhilAPatel/2/base 2025-12-04T12:46:35.4739341Z * [new branch] gh/NikhilAPatel/2/head -> origin/gh/NikhilAPatel/2/head 2025-12-04T12:46:35.4739570Z * [new branch] gh/NikhilAPatel/4/base -> origin/gh/NikhilAPatel/4/base 2025-12-04T12:46:35.4739768Z * [new branch] gh/NikhilAPatel/4/head -> origin/gh/NikhilAPatel/4/head 2025-12-04T12:46:35.4739956Z * [new branch] gh/NikhilAPatel/5/base -> origin/gh/NikhilAPatel/5/base 2025-12-04T12:46:35.4740180Z * [new branch] gh/NikhilAPatel/5/head -> origin/gh/NikhilAPatel/5/head 2025-12-04T12:46:35.4740373Z * [new branch] gh/NikhilAPatel/5/orig -> origin/gh/NikhilAPatel/5/orig 2025-12-04T12:46:35.4740560Z * [new branch] gh/PaliC/17/base -> origin/gh/PaliC/17/base 2025-12-04T12:46:35.4740738Z * [new branch] gh/PaliC/17/head -> origin/gh/PaliC/17/head 2025-12-04T12:46:35.4740914Z * [new branch] gh/PaliC/17/orig -> origin/gh/PaliC/17/orig 2025-12-04T12:46:35.4741084Z * [new branch] gh/PaliC/18/base -> origin/gh/PaliC/18/base 2025-12-04T12:46:35.4741260Z * [new branch] gh/PaliC/18/head -> origin/gh/PaliC/18/head 2025-12-04T12:46:35.4741430Z * [new branch] gh/PaliC/18/orig -> origin/gh/PaliC/18/orig 2025-12-04T12:46:35.4741598Z * [new branch] gh/PaliC/20/base -> origin/gh/PaliC/20/base 2025-12-04T12:46:35.4741773Z * [new branch] gh/PaliC/20/head -> origin/gh/PaliC/20/head 2025-12-04T12:46:35.4741940Z * [new branch] gh/PaliC/20/orig -> origin/gh/PaliC/20/orig 2025-12-04T12:46:35.4742109Z * [new branch] gh/PaliC/21/base -> origin/gh/PaliC/21/base 2025-12-04T12:46:35.4742281Z * [new branch] gh/PaliC/21/head -> origin/gh/PaliC/21/head 2025-12-04T12:46:35.4742450Z * [new branch] gh/PaliC/21/orig -> origin/gh/PaliC/21/orig 2025-12-04T12:46:35.4742624Z * [new branch] gh/PaliC/23/base -> origin/gh/PaliC/23/base 2025-12-04T12:46:35.4742798Z * [new branch] gh/PaliC/23/head -> origin/gh/PaliC/23/head 2025-12-04T12:46:35.4742966Z * [new branch] gh/PaliC/23/orig -> origin/gh/PaliC/23/orig 2025-12-04T12:46:35.4743136Z * [new branch] gh/PaliC/24/base -> origin/gh/PaliC/24/base 2025-12-04T12:46:35.4743309Z * [new branch] gh/PaliC/24/head -> origin/gh/PaliC/24/head 2025-12-04T12:46:35.4743476Z * [new branch] gh/PaliC/24/orig -> origin/gh/PaliC/24/orig 2025-12-04T12:46:35.4743648Z * [new branch] gh/PaliC/25/head -> origin/gh/PaliC/25/head 2025-12-04T12:46:35.4743818Z * [new branch] gh/PaliC/25/next -> origin/gh/PaliC/25/next 2025-12-04T12:46:35.4743986Z * [new branch] gh/PaliC/25/orig -> origin/gh/PaliC/25/orig 2025-12-04T12:46:35.4744159Z * [new branch] gh/PaliC/26/head -> origin/gh/PaliC/26/head 2025-12-04T12:46:35.4744331Z * [new branch] gh/PaliC/26/next -> origin/gh/PaliC/26/next 2025-12-04T12:46:35.4744499Z * [new branch] gh/PaliC/26/orig -> origin/gh/PaliC/26/orig 2025-12-04T12:46:35.4744669Z * [new branch] gh/PaliC/27/next -> origin/gh/PaliC/27/next 2025-12-04T12:46:35.4744842Z * [new branch] gh/PaliC/28/head -> origin/gh/PaliC/28/head 2025-12-04T12:46:35.4745013Z * [new branch] gh/PaliC/28/next -> origin/gh/PaliC/28/next 2025-12-04T12:46:35.4745186Z * [new branch] gh/PaliC/28/orig -> origin/gh/PaliC/28/orig 2025-12-04T12:46:35.4745353Z * [new branch] gh/PaliC/29/head -> origin/gh/PaliC/29/head 2025-12-04T12:46:35.4745524Z * [new branch] gh/PaliC/29/next -> origin/gh/PaliC/29/next 2025-12-04T12:46:35.4745694Z * [new branch] gh/PaliC/29/orig -> origin/gh/PaliC/29/orig 2025-12-04T12:46:35.4745889Z * [new branch] gh/PaliC/30/head -> origin/gh/PaliC/30/head 2025-12-04T12:46:35.4746063Z * [new branch] gh/PaliC/30/next -> origin/gh/PaliC/30/next 2025-12-04T12:46:35.4746235Z * [new branch] gh/PaliC/30/orig -> origin/gh/PaliC/30/orig 2025-12-04T12:46:35.4746402Z * [new branch] gh/PaliC/31/head -> origin/gh/PaliC/31/head 2025-12-04T12:46:35.4746595Z * [new branch] gh/PaliC/31/next -> origin/gh/PaliC/31/next 2025-12-04T12:46:35.4746767Z * [new branch] gh/PaliC/31/orig -> origin/gh/PaliC/31/orig 2025-12-04T12:46:35.4746950Z * [new branch] gh/PaulZhang12/25/base -> origin/gh/PaulZhang12/25/base 2025-12-04T12:46:35.4747141Z * [new branch] gh/PaulZhang12/25/head -> origin/gh/PaulZhang12/25/head 2025-12-04T12:46:35.4747329Z * [new branch] gh/PaulZhang12/25/orig -> origin/gh/PaulZhang12/25/orig 2025-12-04T12:46:35.4747563Z * [new branch] gh/PaulZhang12/28/base -> origin/gh/PaulZhang12/28/base 2025-12-04T12:46:35.4747756Z * [new branch] gh/PaulZhang12/28/head -> origin/gh/PaulZhang12/28/head 2025-12-04T12:46:35.4747951Z * [new branch] gh/PaulZhang12/28/orig -> origin/gh/PaulZhang12/28/orig 2025-12-04T12:46:35.4748141Z * [new branch] gh/PaulZhang12/31/base -> origin/gh/PaulZhang12/31/base 2025-12-04T12:46:35.4748338Z * [new branch] gh/PaulZhang12/31/head -> origin/gh/PaulZhang12/31/head 2025-12-04T12:46:35.4748534Z * [new branch] gh/PaulZhang12/31/orig -> origin/gh/PaulZhang12/31/orig 2025-12-04T12:46:35.4748723Z * [new branch] gh/PaulZhang12/37/base -> origin/gh/PaulZhang12/37/base 2025-12-04T12:46:35.4748919Z * [new branch] gh/PaulZhang12/37/head -> origin/gh/PaulZhang12/37/head 2025-12-04T12:46:35.4749112Z * [new branch] gh/PaulZhang12/37/orig -> origin/gh/PaulZhang12/37/orig 2025-12-04T12:46:35.4749308Z * [new branch] gh/PaulZhang12/40/base -> origin/gh/PaulZhang12/40/base 2025-12-04T12:46:35.4749502Z * [new branch] gh/PaulZhang12/40/head -> origin/gh/PaulZhang12/40/head 2025-12-04T12:46:35.4749691Z * [new branch] gh/PaulZhang12/40/orig -> origin/gh/PaulZhang12/40/orig 2025-12-04T12:46:35.4749886Z * [new branch] gh/PaulZhang12/42/base -> origin/gh/PaulZhang12/42/base 2025-12-04T12:46:35.4750189Z * [new branch] gh/PaulZhang12/42/head -> origin/gh/PaulZhang12/42/head 2025-12-04T12:46:35.4750379Z * [new branch] gh/PaulZhang12/43/base -> origin/gh/PaulZhang12/43/base 2025-12-04T12:46:35.4750575Z * [new branch] gh/PaulZhang12/43/head -> origin/gh/PaulZhang12/43/head 2025-12-04T12:46:35.4750770Z * [new branch] gh/PaulZhang12/43/orig -> origin/gh/PaulZhang12/43/orig 2025-12-04T12:46:35.4750959Z * [new branch] gh/PaulZhang12/44/base -> origin/gh/PaulZhang12/44/base 2025-12-04T12:46:35.4751158Z * [new branch] gh/PaulZhang12/44/head -> origin/gh/PaulZhang12/44/head 2025-12-04T12:46:35.4751388Z * [new branch] gh/PaulZhang12/45/base -> origin/gh/PaulZhang12/45/base 2025-12-04T12:46:35.4751601Z * [new branch] gh/PaulZhang12/45/head -> origin/gh/PaulZhang12/45/head 2025-12-04T12:46:35.4751825Z * [new branch] gh/PaulZhang12/45/orig -> origin/gh/PaulZhang12/45/orig 2025-12-04T12:46:35.4752441Z * [new branch] gh/PaulZhang12/46/base -> origin/gh/PaulZhang12/46/base 2025-12-04T12:46:35.4752665Z * [new branch] gh/PaulZhang12/46/head -> origin/gh/PaulZhang12/46/head 2025-12-04T12:46:35.4752879Z * [new branch] gh/PaulZhang12/46/orig -> origin/gh/PaulZhang12/46/orig 2025-12-04T12:46:35.4753117Z * [new branch] gh/PaulZhang12/47/base -> origin/gh/PaulZhang12/47/base 2025-12-04T12:46:35.4753378Z * [new branch] gh/PaulZhang12/47/head -> origin/gh/PaulZhang12/47/head 2025-12-04T12:46:35.4753592Z * [new branch] gh/PaulZhang12/47/orig -> origin/gh/PaulZhang12/47/orig 2025-12-04T12:46:35.4753829Z * [new branch] gh/PaulZhang12/48/base -> origin/gh/PaulZhang12/48/base 2025-12-04T12:46:35.4754051Z * [new branch] gh/PaulZhang12/48/head -> origin/gh/PaulZhang12/48/head 2025-12-04T12:46:35.4754303Z * [new branch] gh/PaulZhang12/48/orig -> origin/gh/PaulZhang12/48/orig 2025-12-04T12:46:35.4754528Z * [new branch] gh/SamGinzburg/11/base -> origin/gh/SamGinzburg/11/base 2025-12-04T12:46:35.4754748Z * [new branch] gh/SamGinzburg/11/head -> origin/gh/SamGinzburg/11/head 2025-12-04T12:46:35.4754974Z * [new branch] gh/SherlockNoMad/1/base -> origin/gh/SherlockNoMad/1/base 2025-12-04T12:46:35.4755211Z * [new branch] gh/SherlockNoMad/1/head -> origin/gh/SherlockNoMad/1/head 2025-12-04T12:46:35.4755445Z * [new branch] gh/SherlockNoMad/10/base -> origin/gh/SherlockNoMad/10/base 2025-12-04T12:46:35.4755677Z * [new branch] gh/SherlockNoMad/10/head -> origin/gh/SherlockNoMad/10/head 2025-12-04T12:46:35.4755911Z * [new branch] gh/SherlockNoMad/10/orig -> origin/gh/SherlockNoMad/10/orig 2025-12-04T12:46:35.4756143Z * [new branch] gh/SherlockNoMad/11/base -> origin/gh/SherlockNoMad/11/base 2025-12-04T12:46:35.4756372Z * [new branch] gh/SherlockNoMad/11/head -> origin/gh/SherlockNoMad/11/head 2025-12-04T12:46:35.4756608Z * [new branch] gh/SherlockNoMad/11/orig -> origin/gh/SherlockNoMad/11/orig 2025-12-04T12:46:35.4756853Z * [new branch] gh/SherlockNoMad/12/base -> origin/gh/SherlockNoMad/12/base 2025-12-04T12:46:35.4757081Z * [new branch] gh/SherlockNoMad/12/head -> origin/gh/SherlockNoMad/12/head 2025-12-04T12:46:35.4757316Z * [new branch] gh/SherlockNoMad/12/orig -> origin/gh/SherlockNoMad/12/orig 2025-12-04T12:46:35.4757635Z * [new branch] gh/SherlockNoMad/15/base -> origin/gh/SherlockNoMad/15/base 2025-12-04T12:46:35.4758163Z * [new branch] gh/SherlockNoMad/15/head -> origin/gh/SherlockNoMad/15/head 2025-12-04T12:46:35.4758395Z * [new branch] gh/SherlockNoMad/15/orig -> origin/gh/SherlockNoMad/15/orig 2025-12-04T12:46:35.4781789Z * [new branch] gh/SherlockNoMad/17/base -> origin/gh/SherlockNoMad/17/base 2025-12-04T12:46:35.4782025Z * [new branch] gh/SherlockNoMad/17/head -> origin/gh/SherlockNoMad/17/head 2025-12-04T12:46:35.4782246Z * [new branch] gh/SherlockNoMad/17/orig -> origin/gh/SherlockNoMad/17/orig 2025-12-04T12:46:35.4782473Z * [new branch] gh/SherlockNoMad/18/base -> origin/gh/SherlockNoMad/18/base 2025-12-04T12:46:35.4782709Z * [new branch] gh/SherlockNoMad/18/head -> origin/gh/SherlockNoMad/18/head 2025-12-04T12:46:35.4782945Z * [new branch] gh/SherlockNoMad/18/orig -> origin/gh/SherlockNoMad/18/orig 2025-12-04T12:46:35.4783157Z * [new branch] gh/SherlockNoMad/19/base -> origin/gh/SherlockNoMad/19/base 2025-12-04T12:46:35.4783375Z * [new branch] gh/SherlockNoMad/19/head -> origin/gh/SherlockNoMad/19/head 2025-12-04T12:46:35.4783592Z * [new branch] gh/SherlockNoMad/19/orig -> origin/gh/SherlockNoMad/19/orig 2025-12-04T12:46:35.4783818Z * [new branch] gh/SherlockNoMad/2/base -> origin/gh/SherlockNoMad/2/base 2025-12-04T12:46:35.4784057Z * [new branch] gh/SherlockNoMad/2/head -> origin/gh/SherlockNoMad/2/head 2025-12-04T12:46:35.4784285Z * [new branch] gh/SherlockNoMad/20/base -> origin/gh/SherlockNoMad/20/base 2025-12-04T12:46:35.4784494Z * [new branch] gh/SherlockNoMad/20/head -> origin/gh/SherlockNoMad/20/head 2025-12-04T12:46:35.4784694Z * [new branch] gh/SherlockNoMad/20/orig -> origin/gh/SherlockNoMad/20/orig 2025-12-04T12:46:35.4784967Z * [new branch] gh/SherlockNoMad/21/base -> origin/gh/SherlockNoMad/21/base 2025-12-04T12:46:35.4785175Z * [new branch] gh/SherlockNoMad/21/head -> origin/gh/SherlockNoMad/21/head 2025-12-04T12:46:35.4785378Z * [new branch] gh/SherlockNoMad/21/orig -> origin/gh/SherlockNoMad/21/orig 2025-12-04T12:46:35.4785625Z * [new branch] gh/SherlockNoMad/3/base -> origin/gh/SherlockNoMad/3/base 2025-12-04T12:46:35.4785826Z * [new branch] gh/SherlockNoMad/3/head -> origin/gh/SherlockNoMad/3/head 2025-12-04T12:46:35.4786017Z * [new branch] gh/SherlockNoMad/4/base -> origin/gh/SherlockNoMad/4/base 2025-12-04T12:46:35.4786213Z * [new branch] gh/SherlockNoMad/4/head -> origin/gh/SherlockNoMad/4/head 2025-12-04T12:46:35.4786410Z * [new branch] gh/SherlockNoMad/5/base -> origin/gh/SherlockNoMad/5/base 2025-12-04T12:46:35.4786606Z * [new branch] gh/SherlockNoMad/5/head -> origin/gh/SherlockNoMad/5/head 2025-12-04T12:46:35.4786818Z * [new branch] gh/Sidharth123-cpu/24/base -> origin/gh/Sidharth123-cpu/24/base 2025-12-04T12:46:35.4787040Z * [new branch] gh/Sidharth123-cpu/25/base -> origin/gh/Sidharth123-cpu/25/base 2025-12-04T12:46:35.4787251Z * [new branch] gh/Sidharth123-cpu/26/base -> origin/gh/Sidharth123-cpu/26/base 2025-12-04T12:46:35.4787465Z * [new branch] gh/Sidharth123-cpu/27/base -> origin/gh/Sidharth123-cpu/27/base 2025-12-04T12:46:35.4787733Z * [new branch] gh/StrongerXi/1/base -> origin/gh/StrongerXi/1/base 2025-12-04T12:46:35.4787921Z * [new branch] gh/StrongerXi/1/head -> origin/gh/StrongerXi/1/head 2025-12-04T12:46:35.4788115Z * [new branch] gh/StrongerXi/71/base -> origin/gh/StrongerXi/71/base 2025-12-04T12:46:35.4788310Z * [new branch] gh/StrongerXi/71/head -> origin/gh/StrongerXi/71/head 2025-12-04T12:46:35.4788506Z * [new branch] gh/StrongerXi/72/base -> origin/gh/StrongerXi/72/base 2025-12-04T12:46:35.4788701Z * [new branch] gh/StrongerXi/72/head -> origin/gh/StrongerXi/72/head 2025-12-04T12:46:35.4788895Z * [new branch] gh/StrongerXi/73/base -> origin/gh/StrongerXi/73/base 2025-12-04T12:46:35.4789080Z * [new branch] gh/StrongerXi/73/head -> origin/gh/StrongerXi/73/head 2025-12-04T12:46:35.4789280Z * [new branch] gh/StrongerXi/73/orig -> origin/gh/StrongerXi/73/orig 2025-12-04T12:46:35.4789478Z * [new branch] gh/XilunWu/160/base -> origin/gh/XilunWu/160/base 2025-12-04T12:46:35.4789665Z * [new branch] gh/XilunWu/160/head -> origin/gh/XilunWu/160/head 2025-12-04T12:46:35.4789850Z * [new branch] gh/XilunWu/160/orig -> origin/gh/XilunWu/160/orig 2025-12-04T12:46:35.4790033Z * [new branch] gh/XilunWu/163/base -> origin/gh/XilunWu/163/base 2025-12-04T12:46:35.4790217Z * [new branch] gh/XilunWu/163/head -> origin/gh/XilunWu/163/head 2025-12-04T12:46:35.4790421Z * [new branch] gh/XilunWu/163/orig -> origin/gh/XilunWu/163/orig 2025-12-04T12:46:35.4790611Z * [new branch] gh/XilunWu/168/base -> origin/gh/XilunWu/168/base 2025-12-04T12:46:35.4790801Z * [new branch] gh/XilunWu/168/head -> origin/gh/XilunWu/168/head 2025-12-04T12:46:35.4791010Z * [new branch] gh/XilunWu/168/orig -> origin/gh/XilunWu/168/orig 2025-12-04T12:46:35.4791192Z * [new branch] gh/XilunWu/169/base -> origin/gh/XilunWu/169/base 2025-12-04T12:46:35.4791383Z * [new branch] gh/XilunWu/169/head -> origin/gh/XilunWu/169/head 2025-12-04T12:46:35.4791569Z * [new branch] gh/XilunWu/169/orig -> origin/gh/XilunWu/169/orig 2025-12-04T12:46:35.4791746Z * [new branch] gh/XilunWu/170/base -> origin/gh/XilunWu/170/base 2025-12-04T12:46:35.4791984Z * [new branch] gh/XilunWu/170/head -> origin/gh/XilunWu/170/head 2025-12-04T12:46:35.4792165Z * [new branch] gh/XilunWu/170/orig -> origin/gh/XilunWu/170/orig 2025-12-04T12:46:35.4792344Z * [new branch] gh/XilunWu/171/base -> origin/gh/XilunWu/171/base 2025-12-04T12:46:35.4792552Z * [new branch] gh/XilunWu/171/head -> origin/gh/XilunWu/171/head 2025-12-04T12:46:35.4792741Z * [new branch] gh/XilunWu/171/orig -> origin/gh/XilunWu/171/orig 2025-12-04T12:46:35.4792923Z * [new branch] gh/XilunWu/173/base -> origin/gh/XilunWu/173/base 2025-12-04T12:46:35.4793111Z * [new branch] gh/XilunWu/173/head -> origin/gh/XilunWu/173/head 2025-12-04T12:46:35.4793292Z * [new branch] gh/XilunWu/173/orig -> origin/gh/XilunWu/173/orig 2025-12-04T12:46:35.4793475Z * [new branch] gh/XilunWu/175/base -> origin/gh/XilunWu/175/base 2025-12-04T12:46:35.4793666Z * [new branch] gh/XilunWu/175/head -> origin/gh/XilunWu/175/head 2025-12-04T12:46:35.4793852Z * [new branch] gh/XilunWu/175/orig -> origin/gh/XilunWu/175/orig 2025-12-04T12:46:35.4794034Z * [new branch] gh/XilunWu/176/base -> origin/gh/XilunWu/176/base 2025-12-04T12:46:35.4794226Z * [new branch] gh/XilunWu/176/head -> origin/gh/XilunWu/176/head 2025-12-04T12:46:35.4794409Z * [new branch] gh/XilunWu/176/orig -> origin/gh/XilunWu/176/orig 2025-12-04T12:46:35.4794602Z * [new branch] gh/XuehaiPan/14/base -> origin/gh/XuehaiPan/14/base 2025-12-04T12:46:35.4794794Z * [new branch] gh/XuehaiPan/14/head -> origin/gh/XuehaiPan/14/head 2025-12-04T12:46:35.4794981Z * [new branch] gh/XuehaiPan/14/orig -> origin/gh/XuehaiPan/14/orig 2025-12-04T12:46:35.4795172Z * [new branch] gh/XuehaiPan/179/base -> origin/gh/XuehaiPan/179/base 2025-12-04T12:46:35.4795365Z * [new branch] gh/XuehaiPan/179/head -> origin/gh/XuehaiPan/179/head 2025-12-04T12:46:35.4795550Z * [new branch] gh/XuehaiPan/179/orig -> origin/gh/XuehaiPan/179/orig 2025-12-04T12:46:35.4795742Z * [new branch] gh/XuehaiPan/249/base -> origin/gh/XuehaiPan/249/base 2025-12-04T12:46:35.4795930Z * [new branch] gh/XuehaiPan/249/head -> origin/gh/XuehaiPan/249/head 2025-12-04T12:46:35.4796113Z * [new branch] gh/XuehaiPan/249/orig -> origin/gh/XuehaiPan/249/orig 2025-12-04T12:46:35.4796300Z * [new branch] gh/XuehaiPan/253/base -> origin/gh/XuehaiPan/253/base 2025-12-04T12:46:35.4796484Z * [new branch] gh/XuehaiPan/253/head -> origin/gh/XuehaiPan/253/head 2025-12-04T12:46:35.4796668Z * [new branch] gh/XuehaiPan/253/orig -> origin/gh/XuehaiPan/253/orig 2025-12-04T12:46:35.4796853Z * [new branch] gh/XuehaiPan/254/base -> origin/gh/XuehaiPan/254/base 2025-12-04T12:46:35.4797042Z * [new branch] gh/XuehaiPan/254/head -> origin/gh/XuehaiPan/254/head 2025-12-04T12:46:35.4797230Z * [new branch] gh/XuehaiPan/254/orig -> origin/gh/XuehaiPan/254/orig 2025-12-04T12:46:35.4797415Z * [new branch] gh/XuehaiPan/255/base -> origin/gh/XuehaiPan/255/base 2025-12-04T12:46:35.4797661Z * [new branch] gh/XuehaiPan/255/head -> origin/gh/XuehaiPan/255/head 2025-12-04T12:46:35.4797846Z * [new branch] gh/XuehaiPan/255/orig -> origin/gh/XuehaiPan/255/orig 2025-12-04T12:46:35.4798030Z * [new branch] gh/XuehaiPan/271/base -> origin/gh/XuehaiPan/271/base 2025-12-04T12:46:35.4798213Z * [new branch] gh/XuehaiPan/271/head -> origin/gh/XuehaiPan/271/head 2025-12-04T12:46:35.4798401Z * [new branch] gh/XuehaiPan/271/orig -> origin/gh/XuehaiPan/271/orig 2025-12-04T12:46:35.4798637Z * [new branch] gh/XuehaiPan/343/base -> origin/gh/XuehaiPan/343/base 2025-12-04T12:46:35.4798822Z * [new branch] gh/XuehaiPan/343/head -> origin/gh/XuehaiPan/343/head 2025-12-04T12:46:35.4799006Z * [new branch] gh/XuehaiPan/343/orig -> origin/gh/XuehaiPan/343/orig 2025-12-04T12:46:35.4799192Z * [new branch] gh/XuehaiPan/347/base -> origin/gh/XuehaiPan/347/base 2025-12-04T12:46:35.4799413Z * [new branch] gh/XuehaiPan/347/head -> origin/gh/XuehaiPan/347/head 2025-12-04T12:46:35.4799601Z * [new branch] gh/XuehaiPan/347/orig -> origin/gh/XuehaiPan/347/orig 2025-12-04T12:46:35.4799785Z * [new branch] gh/XuehaiPan/348/base -> origin/gh/XuehaiPan/348/base 2025-12-04T12:46:35.4799969Z * [new branch] gh/XuehaiPan/348/head -> origin/gh/XuehaiPan/348/head 2025-12-04T12:46:35.4800157Z * [new branch] gh/XuehaiPan/348/orig -> origin/gh/XuehaiPan/348/orig 2025-12-04T12:46:35.4800343Z * [new branch] gh/XuehaiPan/350/base -> origin/gh/XuehaiPan/350/base 2025-12-04T12:46:35.4800525Z * [new branch] gh/XuehaiPan/350/head -> origin/gh/XuehaiPan/350/head 2025-12-04T12:46:35.4800710Z * [new branch] gh/XuehaiPan/350/orig -> origin/gh/XuehaiPan/350/orig 2025-12-04T12:46:35.4800893Z * [new branch] gh/XuehaiPan/365/base -> origin/gh/XuehaiPan/365/base 2025-12-04T12:46:35.4801078Z * [new branch] gh/XuehaiPan/365/head -> origin/gh/XuehaiPan/365/head 2025-12-04T12:46:35.4801268Z * [new branch] gh/XuehaiPan/365/orig -> origin/gh/XuehaiPan/365/orig 2025-12-04T12:46:35.4801452Z * [new branch] gh/XuehaiPan/366/base -> origin/gh/XuehaiPan/366/base 2025-12-04T12:46:35.4801639Z * [new branch] gh/XuehaiPan/366/head -> origin/gh/XuehaiPan/366/head 2025-12-04T12:46:35.4801824Z * [new branch] gh/XuehaiPan/370/base -> origin/gh/XuehaiPan/370/base 2025-12-04T12:46:35.4802007Z * [new branch] gh/XuehaiPan/370/head -> origin/gh/XuehaiPan/370/head 2025-12-04T12:46:35.4802195Z * [new branch] gh/XuehaiPan/370/orig -> origin/gh/XuehaiPan/370/orig 2025-12-04T12:46:35.4802381Z * [new branch] gh/XuehaiPan/390/base -> origin/gh/XuehaiPan/390/base 2025-12-04T12:46:35.4802563Z * [new branch] gh/XuehaiPan/390/head -> origin/gh/XuehaiPan/390/head 2025-12-04T12:46:35.4802751Z * [new branch] gh/XuehaiPan/390/orig -> origin/gh/XuehaiPan/390/orig 2025-12-04T12:46:35.4802940Z * [new branch] gh/XuehaiPan/391/base -> origin/gh/XuehaiPan/391/base 2025-12-04T12:46:35.4803125Z * [new branch] gh/XuehaiPan/391/head -> origin/gh/XuehaiPan/391/head 2025-12-04T12:46:35.4803309Z * [new branch] gh/XuehaiPan/391/orig -> origin/gh/XuehaiPan/391/orig 2025-12-04T12:46:35.4803496Z * [new branch] gh/XuehaiPan/392/base -> origin/gh/XuehaiPan/392/base 2025-12-04T12:46:35.4803686Z * [new branch] gh/XuehaiPan/392/head -> origin/gh/XuehaiPan/392/head 2025-12-04T12:46:35.4803874Z * [new branch] gh/XuehaiPan/392/orig -> origin/gh/XuehaiPan/392/orig 2025-12-04T12:46:35.4804063Z * [new branch] gh/XuehaiPan/394/base -> origin/gh/XuehaiPan/394/base 2025-12-04T12:46:35.4804249Z * [new branch] gh/XuehaiPan/394/head -> origin/gh/XuehaiPan/394/head 2025-12-04T12:46:35.4804436Z * [new branch] gh/XuehaiPan/394/orig -> origin/gh/XuehaiPan/394/orig 2025-12-04T12:46:35.4804626Z * [new branch] gh/XuehaiPan/397/base -> origin/gh/XuehaiPan/397/base 2025-12-04T12:46:35.4804810Z * [new branch] gh/XuehaiPan/397/head -> origin/gh/XuehaiPan/397/head 2025-12-04T12:46:35.4804995Z * [new branch] gh/XuehaiPan/397/orig -> origin/gh/XuehaiPan/397/orig 2025-12-04T12:46:35.4805179Z * [new branch] gh/XuehaiPan/398/base -> origin/gh/XuehaiPan/398/base 2025-12-04T12:46:35.4805390Z * [new branch] gh/XuehaiPan/398/head -> origin/gh/XuehaiPan/398/head 2025-12-04T12:46:35.4805582Z * [new branch] gh/XuehaiPan/398/orig -> origin/gh/XuehaiPan/398/orig 2025-12-04T12:46:35.4805766Z * [new branch] gh/XuehaiPan/399/base -> origin/gh/XuehaiPan/399/base 2025-12-04T12:46:35.4805973Z * [new branch] gh/XuehaiPan/399/head -> origin/gh/XuehaiPan/399/head 2025-12-04T12:46:35.4806157Z * [new branch] gh/XuehaiPan/399/orig -> origin/gh/XuehaiPan/399/orig 2025-12-04T12:46:35.4806340Z * [new branch] gh/XuehaiPan/400/base -> origin/gh/XuehaiPan/400/base 2025-12-04T12:46:35.4806526Z * [new branch] gh/XuehaiPan/400/head -> origin/gh/XuehaiPan/400/head 2025-12-04T12:46:35.4806713Z * [new branch] gh/XuehaiPan/400/orig -> origin/gh/XuehaiPan/400/orig 2025-12-04T12:46:35.4806903Z * [new branch] gh/ZhiweiYan-96/39/base -> origin/gh/ZhiweiYan-96/39/base 2025-12-04T12:46:35.4807102Z * [new branch] gh/ZhiweiYan-96/39/head -> origin/gh/ZhiweiYan-96/39/head 2025-12-04T12:46:35.4807295Z * [new branch] gh/ZhiweiYan-96/39/orig -> origin/gh/ZhiweiYan-96/39/orig 2025-12-04T12:46:35.4807530Z * [new branch] gh/ZhiweiYan-96/44/base -> origin/gh/ZhiweiYan-96/44/base 2025-12-04T12:46:35.4807720Z * [new branch] gh/ZhiweiYan-96/44/head -> origin/gh/ZhiweiYan-96/44/head 2025-12-04T12:46:35.4807906Z * [new branch] gh/ZhiweiYan-96/45/base -> origin/gh/ZhiweiYan-96/45/base 2025-12-04T12:46:35.4808091Z * [new branch] gh/ZhiweiYan-96/45/head -> origin/gh/ZhiweiYan-96/45/head 2025-12-04T12:46:35.4808277Z * [new branch] gh/ZhiweiYan-96/49/base -> origin/gh/ZhiweiYan-96/49/base 2025-12-04T12:46:35.4808463Z * [new branch] gh/ZhiweiYan-96/49/head -> origin/gh/ZhiweiYan-96/49/head 2025-12-04T12:46:35.4808652Z * [new branch] gh/ZhiweiYan-96/62/base -> origin/gh/ZhiweiYan-96/62/base 2025-12-04T12:46:35.4808841Z * [new branch] gh/ZhiweiYan-96/62/head -> origin/gh/ZhiweiYan-96/62/head 2025-12-04T12:46:35.4809033Z * [new branch] gh/ZhiweiYan-96/66/base -> origin/gh/ZhiweiYan-96/66/base 2025-12-04T12:46:35.4809220Z * [new branch] gh/ZhiweiYan-96/66/head -> origin/gh/ZhiweiYan-96/66/head 2025-12-04T12:46:35.4809412Z * [new branch] gh/ZhiweiYan-96/67/base -> origin/gh/ZhiweiYan-96/67/base 2025-12-04T12:46:35.4809602Z * [new branch] gh/ZhiweiYan-96/67/head -> origin/gh/ZhiweiYan-96/67/head 2025-12-04T12:46:35.4809787Z * [new branch] gh/ZhiweiYan-96/68/base -> origin/gh/ZhiweiYan-96/68/base 2025-12-04T12:46:35.4809975Z * [new branch] gh/ZhiweiYan-96/68/head -> origin/gh/ZhiweiYan-96/68/head 2025-12-04T12:46:35.4810163Z * [new branch] gh/ZhiweiYan-96/68/orig -> origin/gh/ZhiweiYan-96/68/orig 2025-12-04T12:46:35.4810353Z * [new branch] gh/aakhundov/1/base -> origin/gh/aakhundov/1/base 2025-12-04T12:46:35.4810540Z * [new branch] gh/aakhundov/1/head -> origin/gh/aakhundov/1/head 2025-12-04T12:46:35.4810722Z * [new branch] gh/aakhundov/2/base -> origin/gh/aakhundov/2/base 2025-12-04T12:46:35.4810903Z * [new branch] gh/aakhundov/2/head -> origin/gh/aakhundov/2/head 2025-12-04T12:46:35.4811088Z * [new branch] gh/aditew01/openblas -> origin/gh/aditew01/openblas 2025-12-04T12:46:35.4811274Z * [new branch] gh/aditew01/sbgemm -> origin/gh/aditew01/sbgemm 2025-12-04T12:46:35.4811456Z * [new branch] gh/aditew01/vecbf16 -> origin/gh/aditew01/vecbf16 2025-12-04T12:46:35.4811637Z * [new branch] gh/albanD/4/base -> origin/gh/albanD/4/base 2025-12-04T12:46:35.4811808Z * [new branch] gh/albanD/4/head -> origin/gh/albanD/4/head 2025-12-04T12:46:35.4812026Z * [new branch] gh/albanD/4/orig -> origin/gh/albanD/4/orig 2025-12-04T12:46:35.4812289Z * [new branch] gh/alexbrauckmann/paddedtensor_faketensor_init -> origin/gh/alexbrauckmann/paddedtensor_faketensor_init 2025-12-04T12:46:35.4812556Z * [new branch] gh/alexsamardzic/12/base -> origin/gh/alexsamardzic/12/base 2025-12-04T12:46:35.4812784Z * [new branch] gh/alexsamardzic/12/head -> origin/gh/alexsamardzic/12/head 2025-12-04T12:46:35.4812982Z * [new branch] gh/alexsamardzic/12/orig -> origin/gh/alexsamardzic/12/orig 2025-12-04T12:46:35.4813175Z * [new branch] gh/alexsamardzic/14/base -> origin/gh/alexsamardzic/14/base 2025-12-04T12:46:35.4813376Z * [new branch] gh/alexsamardzic/14/head -> origin/gh/alexsamardzic/14/head 2025-12-04T12:46:35.4813573Z * [new branch] gh/alexsamardzic/14/orig -> origin/gh/alexsamardzic/14/orig 2025-12-04T12:46:35.4813771Z * [new branch] gh/alexsamardzic/15/base -> origin/gh/alexsamardzic/15/base 2025-12-04T12:46:35.4813968Z * [new branch] gh/alexsamardzic/15/head -> origin/gh/alexsamardzic/15/head 2025-12-04T12:46:35.4814169Z * [new branch] gh/alexsamardzic/15/orig -> origin/gh/alexsamardzic/15/orig 2025-12-04T12:46:35.4814355Z * [new branch] gh/amjames/18/base -> origin/gh/amjames/18/base 2025-12-04T12:46:35.4814534Z * [new branch] gh/amjames/18/head -> origin/gh/amjames/18/head 2025-12-04T12:46:35.4814710Z * [new branch] gh/amjames/18/orig -> origin/gh/amjames/18/orig 2025-12-04T12:46:35.4814891Z * [new branch] gh/andrewor14/35/base -> origin/gh/andrewor14/35/base 2025-12-04T12:46:35.4815075Z * [new branch] gh/andrewor14/35/head -> origin/gh/andrewor14/35/head 2025-12-04T12:46:35.4815261Z * [new branch] gh/andrewor14/35/orig -> origin/gh/andrewor14/35/orig 2025-12-04T12:46:35.4815446Z * [new branch] gh/andrewor14/50/base -> origin/gh/andrewor14/50/base 2025-12-04T12:46:35.4815628Z * [new branch] gh/andrewor14/50/head -> origin/gh/andrewor14/50/head 2025-12-04T12:46:35.4815809Z * [new branch] gh/andrewor14/50/orig -> origin/gh/andrewor14/50/orig 2025-12-04T12:46:35.4815990Z * [new branch] gh/andyanwang/30/base -> origin/gh/andyanwang/30/base 2025-12-04T12:46:35.4816173Z * [new branch] gh/andyanwang/30/orig -> origin/gh/andyanwang/30/orig 2025-12-04T12:46:35.4816360Z * [new branch] gh/andyanwang/31/base -> origin/gh/andyanwang/31/base 2025-12-04T12:46:35.4816541Z * [new branch] gh/andyanwang/31/orig -> origin/gh/andyanwang/31/orig 2025-12-04T12:46:35.4816722Z * [new branch] gh/andyanwang/39/base -> origin/gh/andyanwang/39/base 2025-12-04T12:46:35.4816907Z * [new branch] gh/andyanwang/39/head -> origin/gh/andyanwang/39/head 2025-12-04T12:46:35.4817088Z * [new branch] gh/andyanwang/39/orig -> origin/gh/andyanwang/39/orig 2025-12-04T12:46:35.4817271Z * [new branch] gh/andyanwang/42/base -> origin/gh/andyanwang/42/base 2025-12-04T12:46:35.4817453Z * [new branch] gh/andyanwang/42/head -> origin/gh/andyanwang/42/head 2025-12-04T12:46:35.4817692Z * [new branch] gh/andyanwang/42/orig -> origin/gh/andyanwang/42/orig 2025-12-04T12:46:35.4817875Z * [new branch] gh/andyanwang/45/base -> origin/gh/andyanwang/45/base 2025-12-04T12:46:35.4818056Z * [new branch] gh/andyanwang/45/head -> origin/gh/andyanwang/45/head 2025-12-04T12:46:35.4818240Z * [new branch] gh/andyanwang/45/orig -> origin/gh/andyanwang/45/orig 2025-12-04T12:46:35.4818422Z * [new branch] gh/angelayi/107/base -> origin/gh/angelayi/107/base 2025-12-04T12:46:35.4818601Z * [new branch] gh/angelayi/107/head -> origin/gh/angelayi/107/head 2025-12-04T12:46:35.4818814Z * [new branch] gh/angelayi/114/base -> origin/gh/angelayi/114/base 2025-12-04T12:46:35.4818993Z * [new branch] gh/angelayi/114/head -> origin/gh/angelayi/114/head 2025-12-04T12:46:35.4819169Z * [new branch] gh/angelayi/114/orig -> origin/gh/angelayi/114/orig 2025-12-04T12:46:35.4819374Z * [new branch] gh/angelayi/116/base -> origin/gh/angelayi/116/base 2025-12-04T12:46:35.4819551Z * [new branch] gh/angelayi/116/head -> origin/gh/angelayi/116/head 2025-12-04T12:46:35.4819726Z * [new branch] gh/angelayi/116/orig -> origin/gh/angelayi/116/orig 2025-12-04T12:46:35.4819903Z * [new branch] gh/angelayi/122/base -> origin/gh/angelayi/122/base 2025-12-04T12:46:35.4820079Z * [new branch] gh/angelayi/122/head -> origin/gh/angelayi/122/head 2025-12-04T12:46:35.4820256Z * [new branch] gh/angelayi/122/orig -> origin/gh/angelayi/122/orig 2025-12-04T12:46:35.4820437Z * [new branch] gh/angelayi/124/base -> origin/gh/angelayi/124/base 2025-12-04T12:46:35.4820614Z * [new branch] gh/angelayi/124/head -> origin/gh/angelayi/124/head 2025-12-04T12:46:35.4820790Z * [new branch] gh/angelayi/124/orig -> origin/gh/angelayi/124/orig 2025-12-04T12:46:35.4820968Z * [new branch] gh/angelayi/128/base -> origin/gh/angelayi/128/base 2025-12-04T12:46:35.4821148Z * [new branch] gh/angelayi/128/head -> origin/gh/angelayi/128/head 2025-12-04T12:46:35.4821324Z * [new branch] gh/angelayi/128/orig -> origin/gh/angelayi/128/orig 2025-12-04T12:46:35.4821500Z * [new branch] gh/angelayi/131/base -> origin/gh/angelayi/131/base 2025-12-04T12:46:35.4821677Z * [new branch] gh/angelayi/131/head -> origin/gh/angelayi/131/head 2025-12-04T12:46:35.4821857Z * [new branch] gh/angelayi/131/orig -> origin/gh/angelayi/131/orig 2025-12-04T12:46:35.4822037Z * [new branch] gh/angelayi/132/base -> origin/gh/angelayi/132/base 2025-12-04T12:46:35.4822214Z * [new branch] gh/angelayi/132/head -> origin/gh/angelayi/132/head 2025-12-04T12:46:35.4822390Z * [new branch] gh/angelayi/132/orig -> origin/gh/angelayi/132/orig 2025-12-04T12:46:35.4822569Z * [new branch] gh/angelayi/133/base -> origin/gh/angelayi/133/base 2025-12-04T12:46:35.4822748Z * [new branch] gh/angelayi/133/head -> origin/gh/angelayi/133/head 2025-12-04T12:46:35.4822925Z * [new branch] gh/angelayi/133/orig -> origin/gh/angelayi/133/orig 2025-12-04T12:46:35.4823103Z * [new branch] gh/angelayi/134/base -> origin/gh/angelayi/134/base 2025-12-04T12:46:35.4823277Z * [new branch] gh/angelayi/134/head -> origin/gh/angelayi/134/head 2025-12-04T12:46:35.4823455Z * [new branch] gh/angelayi/134/orig -> origin/gh/angelayi/134/orig 2025-12-04T12:46:35.4823633Z * [new branch] gh/angelayi/135/base -> origin/gh/angelayi/135/base 2025-12-04T12:46:35.4823809Z * [new branch] gh/angelayi/135/head -> origin/gh/angelayi/135/head 2025-12-04T12:46:35.4823991Z * [new branch] gh/angelayi/135/orig -> origin/gh/angelayi/135/orig 2025-12-04T12:46:35.4824170Z * [new branch] gh/angelayi/136/base -> origin/gh/angelayi/136/base 2025-12-04T12:46:35.4824345Z * [new branch] gh/angelayi/136/head -> origin/gh/angelayi/136/head 2025-12-04T12:46:35.4824522Z * [new branch] gh/angelayi/136/orig -> origin/gh/angelayi/136/orig 2025-12-04T12:46:35.4824698Z * [new branch] gh/angelayi/137/base -> origin/gh/angelayi/137/base 2025-12-04T12:46:35.4824874Z * [new branch] gh/angelayi/137/head -> origin/gh/angelayi/137/head 2025-12-04T12:46:35.4825080Z * [new branch] gh/angelayi/137/orig -> origin/gh/angelayi/137/orig 2025-12-04T12:46:35.4825259Z * [new branch] gh/angelayi/138/base -> origin/gh/angelayi/138/base 2025-12-04T12:46:35.4825433Z * [new branch] gh/angelayi/138/head -> origin/gh/angelayi/138/head 2025-12-04T12:46:35.4825615Z * [new branch] gh/angelayi/138/orig -> origin/gh/angelayi/138/orig 2025-12-04T12:46:35.4825818Z * [new branch] gh/angelayi/139/base -> origin/gh/angelayi/139/base 2025-12-04T12:46:35.4825995Z * [new branch] gh/angelayi/139/head -> origin/gh/angelayi/139/head 2025-12-04T12:46:35.4826171Z * [new branch] gh/angelayi/139/orig -> origin/gh/angelayi/139/orig 2025-12-04T12:46:35.4826347Z * [new branch] gh/angelayi/140/base -> origin/gh/angelayi/140/base 2025-12-04T12:46:35.4826525Z * [new branch] gh/angelayi/140/head -> origin/gh/angelayi/140/head 2025-12-04T12:46:35.4826705Z * [new branch] gh/angelayi/140/orig -> origin/gh/angelayi/140/orig 2025-12-04T12:46:35.4826881Z * [new branch] gh/angelayi/141/base -> origin/gh/angelayi/141/base 2025-12-04T12:46:35.4827058Z * [new branch] gh/angelayi/141/head -> origin/gh/angelayi/141/head 2025-12-04T12:46:35.4827235Z * [new branch] gh/angelayi/141/orig -> origin/gh/angelayi/141/orig 2025-12-04T12:46:35.4827411Z * [new branch] gh/angelayi/142/base -> origin/gh/angelayi/142/base 2025-12-04T12:46:35.4827635Z * [new branch] gh/angelayi/142/head -> origin/gh/angelayi/142/head 2025-12-04T12:46:35.4827811Z * [new branch] gh/angelayi/142/orig -> origin/gh/angelayi/142/orig 2025-12-04T12:46:35.4827987Z * [new branch] gh/angelayi/143/base -> origin/gh/angelayi/143/base 2025-12-04T12:46:35.4828166Z * [new branch] gh/angelayi/143/head -> origin/gh/angelayi/143/head 2025-12-04T12:46:35.4828405Z * [new branch] gh/angelayi/143/orig -> origin/gh/angelayi/143/orig 2025-12-04T12:46:35.4828581Z * [new branch] gh/angelayi/144/base -> origin/gh/angelayi/144/base 2025-12-04T12:46:35.4828758Z * [new branch] gh/angelayi/144/head -> origin/gh/angelayi/144/head 2025-12-04T12:46:35.4828935Z * [new branch] gh/angelayi/144/orig -> origin/gh/angelayi/144/orig 2025-12-04T12:46:35.4829121Z * [new branch] gh/anijain2305/753/base -> origin/gh/anijain2305/753/base 2025-12-04T12:46:35.4829311Z * [new branch] gh/anijain2305/753/head -> origin/gh/anijain2305/753/head 2025-12-04T12:46:35.4829496Z * [new branch] gh/anijain2305/753/orig -> origin/gh/anijain2305/753/orig 2025-12-04T12:46:35.4829680Z * [new branch] gh/anijain2305/810/base -> origin/gh/anijain2305/810/base 2025-12-04T12:46:35.4829864Z * [new branch] gh/anijain2305/810/head -> origin/gh/anijain2305/810/head 2025-12-04T12:46:35.4830050Z * [new branch] gh/anijain2305/810/orig -> origin/gh/anijain2305/810/orig 2025-12-04T12:46:35.4830234Z * [new branch] gh/anijain2305/854/base -> origin/gh/anijain2305/854/base 2025-12-04T12:46:35.4830417Z * [new branch] gh/anijain2305/854/head -> origin/gh/anijain2305/854/head 2025-12-04T12:46:35.4830599Z * [new branch] gh/anijain2305/854/orig -> origin/gh/anijain2305/854/orig 2025-12-04T12:46:35.4830780Z * [new branch] gh/anijain2305/864/base -> origin/gh/anijain2305/864/base 2025-12-04T12:46:35.4830964Z * [new branch] gh/anijain2305/864/head -> origin/gh/anijain2305/864/head 2025-12-04T12:46:35.4831149Z * [new branch] gh/anijain2305/864/orig -> origin/gh/anijain2305/864/orig 2025-12-04T12:46:35.4831333Z * [new branch] gh/anijain2305/870/base -> origin/gh/anijain2305/870/base 2025-12-04T12:46:35.4831516Z * [new branch] gh/anijain2305/870/head -> origin/gh/anijain2305/870/head 2025-12-04T12:46:35.4831738Z * [new branch] gh/anijain2305/870/orig -> origin/gh/anijain2305/870/orig 2025-12-04T12:46:35.4831922Z * [new branch] gh/anijain2305/873/base -> origin/gh/anijain2305/873/base 2025-12-04T12:46:35.4832106Z * [new branch] gh/anijain2305/873/head -> origin/gh/anijain2305/873/head 2025-12-04T12:46:35.4832329Z * [new branch] gh/anijain2305/873/orig -> origin/gh/anijain2305/873/orig 2025-12-04T12:46:35.4832512Z * [new branch] gh/anijain2305/894/base -> origin/gh/anijain2305/894/base 2025-12-04T12:46:35.4832695Z * [new branch] gh/anijain2305/894/head -> origin/gh/anijain2305/894/head 2025-12-04T12:46:35.4832877Z * [new branch] gh/anijain2305/894/orig -> origin/gh/anijain2305/894/orig 2025-12-04T12:46:35.4833061Z * [new branch] gh/anijain2305/895/base -> origin/gh/anijain2305/895/base 2025-12-04T12:46:35.4833249Z * [new branch] gh/anijain2305/895/head -> origin/gh/anijain2305/895/head 2025-12-04T12:46:35.4833431Z * [new branch] gh/anijain2305/895/orig -> origin/gh/anijain2305/895/orig 2025-12-04T12:46:35.4833613Z * [new branch] gh/anijain2305/910/base -> origin/gh/anijain2305/910/base 2025-12-04T12:46:35.4833794Z * [new branch] gh/anijain2305/910/head -> origin/gh/anijain2305/910/head 2025-12-04T12:46:35.4833979Z * [new branch] gh/anijain2305/910/orig -> origin/gh/anijain2305/910/orig 2025-12-04T12:46:35.4834165Z * [new branch] gh/anijain2305/919/base -> origin/gh/anijain2305/919/base 2025-12-04T12:46:35.4834347Z * [new branch] gh/anijain2305/919/head -> origin/gh/anijain2305/919/head 2025-12-04T12:46:35.4834530Z * [new branch] gh/anijain2305/919/orig -> origin/gh/anijain2305/919/orig 2025-12-04T12:46:35.4834711Z * [new branch] gh/anijain2305/922/base -> origin/gh/anijain2305/922/base 2025-12-04T12:46:35.4834894Z * [new branch] gh/anijain2305/922/head -> origin/gh/anijain2305/922/head 2025-12-04T12:46:35.4835081Z * [new branch] gh/anijain2305/922/orig -> origin/gh/anijain2305/922/orig 2025-12-04T12:46:35.4835263Z * [new branch] gh/anijain2305/932/base -> origin/gh/anijain2305/932/base 2025-12-04T12:46:35.4835446Z * [new branch] gh/anijain2305/932/head -> origin/gh/anijain2305/932/head 2025-12-04T12:46:35.4835628Z * [new branch] gh/anijain2305/932/orig -> origin/gh/anijain2305/932/orig 2025-12-04T12:46:35.4835814Z * [new branch] gh/anijain2305/940/base -> origin/gh/anijain2305/940/base 2025-12-04T12:46:35.4835996Z * [new branch] gh/anijain2305/940/head -> origin/gh/anijain2305/940/head 2025-12-04T12:46:35.4836178Z * [new branch] gh/anijain2305/940/orig -> origin/gh/anijain2305/940/orig 2025-12-04T12:46:35.4836360Z * [new branch] gh/anijain2305/941/base -> origin/gh/anijain2305/941/base 2025-12-04T12:46:35.4836543Z * [new branch] gh/anijain2305/941/head -> origin/gh/anijain2305/941/head 2025-12-04T12:46:35.4836729Z * [new branch] gh/anijain2305/941/orig -> origin/gh/anijain2305/941/orig 2025-12-04T12:46:35.4836911Z * [new branch] gh/anijain2305/942/base -> origin/gh/anijain2305/942/base 2025-12-04T12:46:35.4837094Z * [new branch] gh/anijain2305/942/head -> origin/gh/anijain2305/942/head 2025-12-04T12:46:35.4837276Z * [new branch] gh/anijain2305/942/orig -> origin/gh/anijain2305/942/orig 2025-12-04T12:46:35.4837458Z * [new branch] gh/anijain2305/943/base -> origin/gh/anijain2305/943/base 2025-12-04T12:46:35.4837682Z * [new branch] gh/anijain2305/943/head -> origin/gh/anijain2305/943/head 2025-12-04T12:46:35.4837865Z * [new branch] gh/anijain2305/943/orig -> origin/gh/anijain2305/943/orig 2025-12-04T12:46:35.4838105Z * [new branch] gh/anijain2305/944/base -> origin/gh/anijain2305/944/base 2025-12-04T12:46:35.4838287Z * [new branch] gh/anijain2305/944/head -> origin/gh/anijain2305/944/head 2025-12-04T12:46:35.4838469Z * [new branch] gh/anijain2305/944/orig -> origin/gh/anijain2305/944/orig 2025-12-04T12:46:35.4838651Z * [new branch] gh/anijain2305/945/base -> origin/gh/anijain2305/945/base 2025-12-04T12:46:35.4838877Z * [new branch] gh/anijain2305/945/head -> origin/gh/anijain2305/945/head 2025-12-04T12:46:35.4839064Z * [new branch] gh/anijain2305/945/orig -> origin/gh/anijain2305/945/orig 2025-12-04T12:46:35.4839246Z * [new branch] gh/anijain2305/946/base -> origin/gh/anijain2305/946/base 2025-12-04T12:46:35.4839430Z * [new branch] gh/anijain2305/946/head -> origin/gh/anijain2305/946/head 2025-12-04T12:46:35.4839615Z * [new branch] gh/anijain2305/946/orig -> origin/gh/anijain2305/946/orig 2025-12-04T12:46:35.4839796Z * [new branch] gh/anijain2305/947/base -> origin/gh/anijain2305/947/base 2025-12-04T12:46:35.4839978Z * [new branch] gh/anijain2305/947/head -> origin/gh/anijain2305/947/head 2025-12-04T12:46:35.4840164Z * [new branch] gh/anijain2305/947/orig -> origin/gh/anijain2305/947/orig 2025-12-04T12:46:35.4840348Z * [new branch] gh/anijain2305/948/base -> origin/gh/anijain2305/948/base 2025-12-04T12:46:35.4840531Z * [new branch] gh/anijain2305/948/head -> origin/gh/anijain2305/948/head 2025-12-04T12:46:35.4840712Z * [new branch] gh/anijain2305/948/orig -> origin/gh/anijain2305/948/orig 2025-12-04T12:46:35.4840893Z * [new branch] gh/anijain2305/949/base -> origin/gh/anijain2305/949/base 2025-12-04T12:46:35.4841077Z * [new branch] gh/anijain2305/949/head -> origin/gh/anijain2305/949/head 2025-12-04T12:46:35.4841260Z * [new branch] gh/anijain2305/949/orig -> origin/gh/anijain2305/949/orig 2025-12-04T12:46:35.4841442Z * [new branch] gh/anijain2305/950/base -> origin/gh/anijain2305/950/base 2025-12-04T12:46:35.4841624Z * [new branch] gh/anijain2305/950/head -> origin/gh/anijain2305/950/head 2025-12-04T12:46:35.4841805Z * [new branch] gh/anijain2305/950/orig -> origin/gh/anijain2305/950/orig 2025-12-04T12:46:35.4841992Z * [new branch] gh/anijain2305/951/base -> origin/gh/anijain2305/951/base 2025-12-04T12:46:35.4842173Z * [new branch] gh/anijain2305/951/head -> origin/gh/anijain2305/951/head 2025-12-04T12:46:35.4842355Z * [new branch] gh/anijain2305/951/orig -> origin/gh/anijain2305/951/orig 2025-12-04T12:46:35.4842535Z * [new branch] gh/anijain2305/952/base -> origin/gh/anijain2305/952/base 2025-12-04T12:46:35.4842718Z * [new branch] gh/anijain2305/952/head -> origin/gh/anijain2305/952/head 2025-12-04T12:46:35.4842903Z * [new branch] gh/anijain2305/952/orig -> origin/gh/anijain2305/952/orig 2025-12-04T12:46:35.4843084Z * [new branch] gh/anijain2305/953/base -> origin/gh/anijain2305/953/base 2025-12-04T12:46:35.4843266Z * [new branch] gh/anijain2305/953/head -> origin/gh/anijain2305/953/head 2025-12-04T12:46:35.4843448Z * [new branch] gh/anijain2305/953/orig -> origin/gh/anijain2305/953/orig 2025-12-04T12:46:35.4843634Z * [new branch] gh/anijain2305/954/base -> origin/gh/anijain2305/954/base 2025-12-04T12:46:35.4843816Z * [new branch] gh/anijain2305/954/head -> origin/gh/anijain2305/954/head 2025-12-04T12:46:35.4843998Z * [new branch] gh/anijain2305/954/orig -> origin/gh/anijain2305/954/orig 2025-12-04T12:46:35.4844183Z * [new branch] gh/anijain2305/955/base -> origin/gh/anijain2305/955/base 2025-12-04T12:46:35.4844365Z * [new branch] gh/anijain2305/955/head -> origin/gh/anijain2305/955/head 2025-12-04T12:46:35.4844581Z * [new branch] gh/anijain2305/955/orig -> origin/gh/anijain2305/955/orig 2025-12-04T12:46:35.4844766Z * [new branch] gh/anijain2305/956/base -> origin/gh/anijain2305/956/base 2025-12-04T12:46:35.4844948Z * [new branch] gh/anijain2305/956/head -> origin/gh/anijain2305/956/head 2025-12-04T12:46:35.4845152Z * [new branch] gh/anijain2305/956/orig -> origin/gh/anijain2305/956/orig 2025-12-04T12:46:35.4845334Z * [new branch] gh/anijain2305/957/base -> origin/gh/anijain2305/957/base 2025-12-04T12:46:35.4845518Z * [new branch] gh/anijain2305/957/head -> origin/gh/anijain2305/957/head 2025-12-04T12:46:35.4845700Z * [new branch] gh/anijain2305/957/orig -> origin/gh/anijain2305/957/orig 2025-12-04T12:46:35.4845883Z * [new branch] gh/anijain2305/958/base -> origin/gh/anijain2305/958/base 2025-12-04T12:46:35.4846067Z * [new branch] gh/anijain2305/958/head -> origin/gh/anijain2305/958/head 2025-12-04T12:46:35.4846249Z * [new branch] gh/anijain2305/958/orig -> origin/gh/anijain2305/958/orig 2025-12-04T12:46:35.4846432Z * [new branch] gh/anijain2305/959/base -> origin/gh/anijain2305/959/base 2025-12-04T12:46:35.4846616Z * [new branch] gh/anijain2305/959/head -> origin/gh/anijain2305/959/head 2025-12-04T12:46:35.4846800Z * [new branch] gh/anijain2305/959/orig -> origin/gh/anijain2305/959/orig 2025-12-04T12:46:35.4846984Z * [new branch] gh/anijain2305/960/base -> origin/gh/anijain2305/960/base 2025-12-04T12:46:35.4847167Z * [new branch] gh/anijain2305/960/head -> origin/gh/anijain2305/960/head 2025-12-04T12:46:35.4847348Z * [new branch] gh/anijain2305/960/orig -> origin/gh/anijain2305/960/orig 2025-12-04T12:46:35.4847568Z * [new branch] gh/anijain2305/961/base -> origin/gh/anijain2305/961/base 2025-12-04T12:46:35.4847755Z * [new branch] gh/anijain2305/961/head -> origin/gh/anijain2305/961/head 2025-12-04T12:46:35.4847937Z * [new branch] gh/anijain2305/961/orig -> origin/gh/anijain2305/961/orig 2025-12-04T12:46:35.4848121Z * [new branch] gh/anijain2305/962/base -> origin/gh/anijain2305/962/base 2025-12-04T12:46:35.4848302Z * [new branch] gh/anijain2305/962/head -> origin/gh/anijain2305/962/head 2025-12-04T12:46:35.4848488Z * [new branch] gh/anijain2305/962/orig -> origin/gh/anijain2305/962/orig 2025-12-04T12:46:35.4848673Z * [new branch] gh/anijain2305/963/base -> origin/gh/anijain2305/963/base 2025-12-04T12:46:35.4848854Z * [new branch] gh/anijain2305/963/head -> origin/gh/anijain2305/963/head 2025-12-04T12:46:35.4849036Z * [new branch] gh/anijain2305/963/orig -> origin/gh/anijain2305/963/orig 2025-12-04T12:46:35.4849219Z * [new branch] gh/anijain2305/964/base -> origin/gh/anijain2305/964/base 2025-12-04T12:46:35.4849402Z * [new branch] gh/anijain2305/964/head -> origin/gh/anijain2305/964/head 2025-12-04T12:46:35.4849588Z * [new branch] gh/anijain2305/964/orig -> origin/gh/anijain2305/964/orig 2025-12-04T12:46:35.4849772Z * [new branch] gh/anijain2305/965/base -> origin/gh/anijain2305/965/base 2025-12-04T12:46:35.4849954Z * [new branch] gh/anijain2305/965/head -> origin/gh/anijain2305/965/head 2025-12-04T12:46:35.4850138Z * [new branch] gh/anijain2305/965/orig -> origin/gh/anijain2305/965/orig 2025-12-04T12:46:35.4850320Z * [new branch] gh/anijain2305/966/base -> origin/gh/anijain2305/966/base 2025-12-04T12:46:35.4850503Z * [new branch] gh/anijain2305/966/head -> origin/gh/anijain2305/966/head 2025-12-04T12:46:35.4850689Z * [new branch] gh/anijain2305/966/orig -> origin/gh/anijain2305/966/orig 2025-12-04T12:46:35.4850871Z * [new branch] gh/anijain2305/967/base -> origin/gh/anijain2305/967/base 2025-12-04T12:46:35.4851096Z * [new branch] gh/anijain2305/967/head -> origin/gh/anijain2305/967/head 2025-12-04T12:46:35.4851279Z * [new branch] gh/anijain2305/967/orig -> origin/gh/anijain2305/967/orig 2025-12-04T12:46:35.4851465Z * [new branch] gh/anijain2305/968/base -> origin/gh/anijain2305/968/base 2025-12-04T12:46:35.4851690Z * [new branch] gh/anijain2305/968/head -> origin/gh/anijain2305/968/head 2025-12-04T12:46:35.4851878Z * [new branch] gh/anijain2305/968/orig -> origin/gh/anijain2305/968/orig 2025-12-04T12:46:35.4852062Z * [new branch] gh/anijain2305/969/base -> origin/gh/anijain2305/969/base 2025-12-04T12:46:35.4852247Z * [new branch] gh/anijain2305/969/head -> origin/gh/anijain2305/969/head 2025-12-04T12:46:35.4852431Z * [new branch] gh/anijain2305/969/orig -> origin/gh/anijain2305/969/orig 2025-12-04T12:46:35.4852615Z * [new branch] gh/anijain2305/970/base -> origin/gh/anijain2305/970/base 2025-12-04T12:46:35.4852798Z * [new branch] gh/anijain2305/970/head -> origin/gh/anijain2305/970/head 2025-12-04T12:46:35.4852981Z * [new branch] gh/anijain2305/970/orig -> origin/gh/anijain2305/970/orig 2025-12-04T12:46:35.4853164Z * [new branch] gh/anjali411/216/base -> origin/gh/anjali411/216/base 2025-12-04T12:46:35.4853350Z * [new branch] gh/anjali411/216/head -> origin/gh/anjali411/216/head 2025-12-04T12:46:35.4853531Z * [new branch] gh/anjali411/216/orig -> origin/gh/anjali411/216/orig 2025-12-04T12:46:35.4853711Z * [new branch] gh/anshul-si/1/base -> origin/gh/anshul-si/1/base 2025-12-04T12:46:35.4853892Z * [new branch] gh/anshul-si/1/head -> origin/gh/anshul-si/1/head 2025-12-04T12:46:35.4854068Z * [new branch] gh/anshul-si/2/base -> origin/gh/anshul-si/2/base 2025-12-04T12:46:35.4854244Z * [new branch] gh/anshul-si/2/head -> origin/gh/anshul-si/2/head 2025-12-04T12:46:35.4854420Z * [new branch] gh/anshul-si/3/base -> origin/gh/anshul-si/3/base 2025-12-04T12:46:35.4854597Z * [new branch] gh/anshul-si/3/head -> origin/gh/anshul-si/3/head 2025-12-04T12:46:35.4854772Z * [new branch] gh/anshul-si/4/base -> origin/gh/anshul-si/4/base 2025-12-04T12:46:35.4854950Z * [new branch] gh/anshul-si/4/head -> origin/gh/anshul-si/4/head 2025-12-04T12:46:35.4855124Z * [new branch] gh/anshul-si/5/base -> origin/gh/anshul-si/5/base 2025-12-04T12:46:35.4855296Z * [new branch] gh/anshul-si/5/head -> origin/gh/anshul-si/5/head 2025-12-04T12:46:35.4855478Z * [new branch] gh/anshul-si/53/base -> origin/gh/anshul-si/53/base 2025-12-04T12:46:35.4855656Z * [new branch] gh/anshul-si/53/head -> origin/gh/anshul-si/53/head 2025-12-04T12:46:35.4855833Z * [new branch] gh/anshul-si/58/base -> origin/gh/anshul-si/58/base 2025-12-04T12:46:35.4856010Z * [new branch] gh/anshul-si/58/head -> origin/gh/anshul-si/58/head 2025-12-04T12:46:35.4856187Z * [new branch] gh/anshul-si/66/base -> origin/gh/anshul-si/66/base 2025-12-04T12:46:35.4856361Z * [new branch] gh/anshul-si/66/head -> origin/gh/anshul-si/66/head 2025-12-04T12:46:35.4856541Z * [new branch] gh/anshul-si/66/orig -> origin/gh/anshul-si/66/orig 2025-12-04T12:46:35.4856714Z * [new branch] gh/anshul-si/67/base -> origin/gh/anshul-si/67/base 2025-12-04T12:46:35.4856891Z * [new branch] gh/anshul-si/67/head -> origin/gh/anshul-si/67/head 2025-12-04T12:46:35.4857067Z * [new branch] gh/anshul-si/67/orig -> origin/gh/anshul-si/67/orig 2025-12-04T12:46:35.4857241Z * [new branch] gh/anshul-si/68/base -> origin/gh/anshul-si/68/base 2025-12-04T12:46:35.4857443Z * [new branch] gh/anshul-si/68/head -> origin/gh/anshul-si/68/head 2025-12-04T12:46:35.4857675Z * [new branch] gh/anshul-si/68/orig -> origin/gh/anshul-si/68/orig 2025-12-04T12:46:35.4857850Z * [new branch] gh/anshul-si/69/base -> origin/gh/anshul-si/69/base 2025-12-04T12:46:35.4858027Z * [new branch] gh/anshul-si/69/head -> origin/gh/anshul-si/69/head 2025-12-04T12:46:35.4858259Z * [new branch] gh/anshul-si/69/orig -> origin/gh/anshul-si/69/orig 2025-12-04T12:46:35.4858433Z * [new branch] gh/anshul-si/70/base -> origin/gh/anshul-si/70/base 2025-12-04T12:46:35.4858612Z * [new branch] gh/anshul-si/70/head -> origin/gh/anshul-si/70/head 2025-12-04T12:46:35.4858790Z * [new branch] gh/anshul-si/70/orig -> origin/gh/anshul-si/70/orig 2025-12-04T12:46:35.4858965Z * [new branch] gh/anshul-si/71/base -> origin/gh/anshul-si/71/base 2025-12-04T12:46:35.4859145Z * [new branch] gh/anshul-si/71/head -> origin/gh/anshul-si/71/head 2025-12-04T12:46:35.4859320Z * [new branch] gh/anshul-si/71/orig -> origin/gh/anshul-si/71/orig 2025-12-04T12:46:35.4859495Z * [new branch] gh/anshul-si/72/base -> origin/gh/anshul-si/72/base 2025-12-04T12:46:35.4859671Z * [new branch] gh/anshul-si/72/head -> origin/gh/anshul-si/72/head 2025-12-04T12:46:35.4859850Z * [new branch] gh/anshul-si/72/orig -> origin/gh/anshul-si/72/orig 2025-12-04T12:46:35.4860024Z * [new branch] gh/anshul-si/73/base -> origin/gh/anshul-si/73/base 2025-12-04T12:46:35.4860200Z * [new branch] gh/anshul-si/73/head -> origin/gh/anshul-si/73/head 2025-12-04T12:46:35.4860375Z * [new branch] gh/anshul-si/73/orig -> origin/gh/anshul-si/73/orig 2025-12-04T12:46:35.4860553Z * [new branch] gh/aorenste/132/base -> origin/gh/aorenste/132/base 2025-12-04T12:46:35.4860736Z * [new branch] gh/aorenste/132/head -> origin/gh/aorenste/132/head 2025-12-04T12:46:35.4860912Z * [new branch] gh/aorenste/134/base -> origin/gh/aorenste/134/base 2025-12-04T12:46:35.4861090Z * [new branch] gh/aorenste/134/head -> origin/gh/aorenste/134/head 2025-12-04T12:46:35.4861269Z * [new branch] gh/aorenste/134/orig -> origin/gh/aorenste/134/orig 2025-12-04T12:46:35.4861446Z * [new branch] gh/aorenste/139/base -> origin/gh/aorenste/139/base 2025-12-04T12:46:35.4861627Z * [new branch] gh/aorenste/139/head -> origin/gh/aorenste/139/head 2025-12-04T12:46:35.4861805Z * [new branch] gh/aorenste/139/orig -> origin/gh/aorenste/139/orig 2025-12-04T12:46:35.4861982Z * [new branch] gh/aorenste/141/base -> origin/gh/aorenste/141/base 2025-12-04T12:46:35.4862163Z * [new branch] gh/aorenste/141/head -> origin/gh/aorenste/141/head 2025-12-04T12:46:35.4862346Z * [new branch] gh/aorenste/145/base -> origin/gh/aorenste/145/base 2025-12-04T12:46:35.4862524Z * [new branch] gh/aorenste/145/head -> origin/gh/aorenste/145/head 2025-12-04T12:46:35.4862704Z * [new branch] gh/aorenste/145/orig -> origin/gh/aorenste/145/orig 2025-12-04T12:46:35.4862886Z * [new branch] gh/aorenste/146/base -> origin/gh/aorenste/146/base 2025-12-04T12:46:35.4863062Z * [new branch] gh/aorenste/146/head -> origin/gh/aorenste/146/head 2025-12-04T12:46:35.4863242Z * [new branch] gh/aorenste/146/orig -> origin/gh/aorenste/146/orig 2025-12-04T12:46:35.4863421Z * [new branch] gh/aorenste/147/base -> origin/gh/aorenste/147/base 2025-12-04T12:46:35.4863597Z * [new branch] gh/aorenste/147/head -> origin/gh/aorenste/147/head 2025-12-04T12:46:35.4863774Z * [new branch] gh/aorenste/147/orig -> origin/gh/aorenste/147/orig 2025-12-04T12:46:35.4863992Z * [new branch] gh/aorenste/148/base -> origin/gh/aorenste/148/base 2025-12-04T12:46:35.4864170Z * [new branch] gh/aorenste/148/head -> origin/gh/aorenste/148/head 2025-12-04T12:46:35.4864349Z * [new branch] gh/aorenste/148/orig -> origin/gh/aorenste/148/orig 2025-12-04T12:46:35.4864550Z * [new branch] gh/aorenste/149/base -> origin/gh/aorenste/149/base 2025-12-04T12:46:35.4864730Z * [new branch] gh/aorenste/149/head -> origin/gh/aorenste/149/head 2025-12-04T12:46:35.4864907Z * [new branch] gh/aorenste/149/orig -> origin/gh/aorenste/149/orig 2025-12-04T12:46:35.4865083Z * [new branch] gh/aorenste/150/base -> origin/gh/aorenste/150/base 2025-12-04T12:46:35.4865263Z * [new branch] gh/aorenste/150/head -> origin/gh/aorenste/150/head 2025-12-04T12:46:35.4865441Z * [new branch] gh/aorenste/150/orig -> origin/gh/aorenste/150/orig 2025-12-04T12:46:35.4865617Z * [new branch] gh/aorenste/151/base -> origin/gh/aorenste/151/base 2025-12-04T12:46:35.4865797Z * [new branch] gh/aorenste/151/head -> origin/gh/aorenste/151/head 2025-12-04T12:46:35.4865976Z * [new branch] gh/aorenste/151/orig -> origin/gh/aorenste/151/orig 2025-12-04T12:46:35.4866154Z * [new branch] gh/aorenste/152/base -> origin/gh/aorenste/152/base 2025-12-04T12:46:35.4866333Z * [new branch] gh/aorenste/152/head -> origin/gh/aorenste/152/head 2025-12-04T12:46:35.4866513Z * [new branch] gh/aorenste/152/orig -> origin/gh/aorenste/152/orig 2025-12-04T12:46:35.4866689Z * [new branch] gh/aorenste/153/base -> origin/gh/aorenste/153/base 2025-12-04T12:46:35.4866867Z * [new branch] gh/aorenste/153/head -> origin/gh/aorenste/153/head 2025-12-04T12:46:35.4867045Z * [new branch] gh/aorenste/153/orig -> origin/gh/aorenste/153/orig 2025-12-04T12:46:35.4867223Z * [new branch] gh/aorenste/154/base -> origin/gh/aorenste/154/base 2025-12-04T12:46:35.4867401Z * [new branch] gh/aorenste/154/head -> origin/gh/aorenste/154/head 2025-12-04T12:46:35.4867612Z * [new branch] gh/aorenste/154/orig -> origin/gh/aorenste/154/orig 2025-12-04T12:46:35.4867790Z * [new branch] gh/aorenste/155/base -> origin/gh/aorenste/155/base 2025-12-04T12:46:35.4867970Z * [new branch] gh/aorenste/155/head -> origin/gh/aorenste/155/head 2025-12-04T12:46:35.4868148Z * [new branch] gh/aorenste/155/orig -> origin/gh/aorenste/155/orig 2025-12-04T12:46:35.4868324Z * [new branch] gh/aorenste/156/base -> origin/gh/aorenste/156/base 2025-12-04T12:46:35.4868501Z * [new branch] gh/aorenste/156/head -> origin/gh/aorenste/156/head 2025-12-04T12:46:35.4868680Z * [new branch] gh/aorenste/156/orig -> origin/gh/aorenste/156/orig 2025-12-04T12:46:35.4868858Z * [new branch] gh/aorenste/157/base -> origin/gh/aorenste/157/base 2025-12-04T12:46:35.4869035Z * [new branch] gh/aorenste/157/head -> origin/gh/aorenste/157/head 2025-12-04T12:46:35.4869211Z * [new branch] gh/aorenste/157/orig -> origin/gh/aorenste/157/orig 2025-12-04T12:46:35.4869393Z * [new branch] gh/aorenste/158/base -> origin/gh/aorenste/158/base 2025-12-04T12:46:35.4869574Z * [new branch] gh/aorenste/158/head -> origin/gh/aorenste/158/head 2025-12-04T12:46:35.4869751Z * [new branch] gh/aorenste/158/orig -> origin/gh/aorenste/158/orig 2025-12-04T12:46:35.4869929Z * [new branch] gh/aorenste/159/base -> origin/gh/aorenste/159/base 2025-12-04T12:46:35.4870107Z * [new branch] gh/aorenste/159/head -> origin/gh/aorenste/159/head 2025-12-04T12:46:35.4870331Z * [new branch] gh/aorenste/159/orig -> origin/gh/aorenste/159/orig 2025-12-04T12:46:35.4870523Z * [new branch] gh/avikchaudhuri/1/base -> origin/gh/avikchaudhuri/1/base 2025-12-04T12:46:35.4870718Z * [new branch] gh/avikchaudhuri/1/head -> origin/gh/avikchaudhuri/1/head 2025-12-04T12:46:35.4870907Z * [new branch] gh/avikchaudhuri/2/base -> origin/gh/avikchaudhuri/2/base 2025-12-04T12:46:35.4871139Z * [new branch] gh/avikchaudhuri/2/head -> origin/gh/avikchaudhuri/2/head 2025-12-04T12:46:35.4871332Z * [new branch] gh/avikchaudhuri/2/orig -> origin/gh/avikchaudhuri/2/orig 2025-12-04T12:46:35.4871515Z * [new branch] gh/bdhirsh/666/base -> origin/gh/bdhirsh/666/base 2025-12-04T12:46:35.4871695Z * [new branch] gh/bdhirsh/666/head -> origin/gh/bdhirsh/666/head 2025-12-04T12:46:35.4871872Z * [new branch] gh/bdhirsh/666/orig -> origin/gh/bdhirsh/666/orig 2025-12-04T12:46:35.4872048Z * [new branch] gh/bdhirsh/668/base -> origin/gh/bdhirsh/668/base 2025-12-04T12:46:35.4872223Z * [new branch] gh/bdhirsh/668/head -> origin/gh/bdhirsh/668/head 2025-12-04T12:46:35.4872400Z * [new branch] gh/bdhirsh/668/orig -> origin/gh/bdhirsh/668/orig 2025-12-04T12:46:35.4872574Z * [new branch] gh/bdhirsh/669/base -> origin/gh/bdhirsh/669/base 2025-12-04T12:46:35.4872752Z * [new branch] gh/bdhirsh/669/head -> origin/gh/bdhirsh/669/head 2025-12-04T12:46:35.4872927Z * [new branch] gh/bdhirsh/669/orig -> origin/gh/bdhirsh/669/orig 2025-12-04T12:46:35.4873102Z * [new branch] gh/bdhirsh/670/base -> origin/gh/bdhirsh/670/base 2025-12-04T12:46:35.4873276Z * [new branch] gh/bdhirsh/670/head -> origin/gh/bdhirsh/670/head 2025-12-04T12:46:35.4873449Z * [new branch] gh/bdhirsh/670/orig -> origin/gh/bdhirsh/670/orig 2025-12-04T12:46:35.4873625Z * [new branch] gh/bdhirsh/672/base -> origin/gh/bdhirsh/672/base 2025-12-04T12:46:35.4873800Z * [new branch] gh/bdhirsh/672/head -> origin/gh/bdhirsh/672/head 2025-12-04T12:46:35.4873973Z * [new branch] gh/bdhirsh/672/orig -> origin/gh/bdhirsh/672/orig 2025-12-04T12:46:35.4874150Z * [new branch] gh/bdhirsh/675/base -> origin/gh/bdhirsh/675/base 2025-12-04T12:46:35.4874326Z * [new branch] gh/bdhirsh/675/head -> origin/gh/bdhirsh/675/head 2025-12-04T12:46:35.4874499Z * [new branch] gh/bdhirsh/675/orig -> origin/gh/bdhirsh/675/orig 2025-12-04T12:46:35.4874675Z * [new branch] gh/bdhirsh/676/base -> origin/gh/bdhirsh/676/base 2025-12-04T12:46:35.4874850Z * [new branch] gh/bdhirsh/676/head -> origin/gh/bdhirsh/676/head 2025-12-04T12:46:35.4875025Z * [new branch] gh/bdhirsh/676/orig -> origin/gh/bdhirsh/676/orig 2025-12-04T12:46:35.4875202Z * [new branch] gh/bdhirsh/677/base -> origin/gh/bdhirsh/677/base 2025-12-04T12:46:35.4875271Z * [new branch] gh/bdhirsh/677/head -> origin/gh/bdhirsh/677/head 2025-12-04T12:46:35.4875340Z * [new branch] gh/bdhirsh/677/orig -> origin/gh/bdhirsh/677/orig 2025-12-04T12:46:35.4875408Z * [new branch] gh/bdhirsh/678/base -> origin/gh/bdhirsh/678/base 2025-12-04T12:46:35.4875478Z * [new branch] gh/bdhirsh/678/head -> origin/gh/bdhirsh/678/head 2025-12-04T12:46:35.4875548Z * [new branch] gh/bdhirsh/678/orig -> origin/gh/bdhirsh/678/orig 2025-12-04T12:46:35.4875617Z * [new branch] gh/bdhirsh/679/base -> origin/gh/bdhirsh/679/base 2025-12-04T12:46:35.4875685Z * [new branch] gh/bdhirsh/679/head -> origin/gh/bdhirsh/679/head 2025-12-04T12:46:35.4875755Z * [new branch] gh/bdhirsh/679/orig -> origin/gh/bdhirsh/679/orig 2025-12-04T12:46:35.4875847Z * [new branch] gh/bdhirsh/680/base -> origin/gh/bdhirsh/680/base 2025-12-04T12:46:35.4875916Z * [new branch] gh/bdhirsh/680/head -> origin/gh/bdhirsh/680/head 2025-12-04T12:46:35.4875985Z * [new branch] gh/bdhirsh/680/orig -> origin/gh/bdhirsh/680/orig 2025-12-04T12:46:35.4876053Z * [new branch] gh/bdhirsh/681/base -> origin/gh/bdhirsh/681/base 2025-12-04T12:46:35.4876144Z * [new branch] gh/bdhirsh/681/head -> origin/gh/bdhirsh/681/head 2025-12-04T12:46:35.4876213Z * [new branch] gh/bdhirsh/681/orig -> origin/gh/bdhirsh/681/orig 2025-12-04T12:46:35.4876303Z * [new branch] gh/benjaminglass1/101/base -> origin/gh/benjaminglass1/101/base 2025-12-04T12:46:35.4876391Z * [new branch] gh/benjaminglass1/101/head -> origin/gh/benjaminglass1/101/head 2025-12-04T12:46:35.4876479Z * [new branch] gh/benjaminglass1/101/orig -> origin/gh/benjaminglass1/101/orig 2025-12-04T12:46:35.4876566Z * [new branch] gh/benjaminglass1/102/base -> origin/gh/benjaminglass1/102/base 2025-12-04T12:46:35.4876652Z * [new branch] gh/benjaminglass1/102/head -> origin/gh/benjaminglass1/102/head 2025-12-04T12:46:35.4876736Z * [new branch] gh/benjaminglass1/102/orig -> origin/gh/benjaminglass1/102/orig 2025-12-04T12:46:35.4876822Z * [new branch] gh/benjaminglass1/106/base -> origin/gh/benjaminglass1/106/base 2025-12-04T12:46:35.4876908Z * [new branch] gh/benjaminglass1/106/head -> origin/gh/benjaminglass1/106/head 2025-12-04T12:46:35.4876992Z * [new branch] gh/benjaminglass1/106/orig -> origin/gh/benjaminglass1/106/orig 2025-12-04T12:46:35.4877075Z * [new branch] gh/benjaminglass1/107/base -> origin/gh/benjaminglass1/107/base 2025-12-04T12:46:35.4877161Z * [new branch] gh/benjaminglass1/107/head -> origin/gh/benjaminglass1/107/head 2025-12-04T12:46:35.4877246Z * [new branch] gh/benjaminglass1/107/orig -> origin/gh/benjaminglass1/107/orig 2025-12-04T12:46:35.4877330Z * [new branch] gh/benjaminglass1/108/base -> origin/gh/benjaminglass1/108/base 2025-12-04T12:46:35.4877418Z * [new branch] gh/benjaminglass1/108/head -> origin/gh/benjaminglass1/108/head 2025-12-04T12:46:35.4877543Z * [new branch] gh/benjaminglass1/108/orig -> origin/gh/benjaminglass1/108/orig 2025-12-04T12:46:35.4877629Z * [new branch] gh/benjaminglass1/109/base -> origin/gh/benjaminglass1/109/base 2025-12-04T12:46:35.4877715Z * [new branch] gh/benjaminglass1/109/head -> origin/gh/benjaminglass1/109/head 2025-12-04T12:46:35.4877798Z * [new branch] gh/benjaminglass1/109/orig -> origin/gh/benjaminglass1/109/orig 2025-12-04T12:46:35.4877883Z * [new branch] gh/benjaminglass1/97/base -> origin/gh/benjaminglass1/97/base 2025-12-04T12:46:35.4877968Z * [new branch] gh/benjaminglass1/97/head -> origin/gh/benjaminglass1/97/head 2025-12-04T12:46:35.4878050Z * [new branch] gh/benjaminglass1/97/orig -> origin/gh/benjaminglass1/97/orig 2025-12-04T12:46:35.4878129Z * [new branch] gh/bobrenjc93/570/base -> origin/gh/bobrenjc93/570/base 2025-12-04T12:46:35.4878206Z * [new branch] gh/bobrenjc93/570/head -> origin/gh/bobrenjc93/570/head 2025-12-04T12:46:35.4878282Z * [new branch] gh/bobrenjc93/570/orig -> origin/gh/bobrenjc93/570/orig 2025-12-04T12:46:35.4878355Z * [new branch] gh/bobrenjc93/604/base -> origin/gh/bobrenjc93/604/base 2025-12-04T12:46:35.4878428Z * [new branch] gh/bobrenjc93/604/head -> origin/gh/bobrenjc93/604/head 2025-12-04T12:46:35.4878500Z * [new branch] gh/bobrenjc93/604/orig -> origin/gh/bobrenjc93/604/orig 2025-12-04T12:46:35.4878573Z * [new branch] gh/bobrenjc93/638/base -> origin/gh/bobrenjc93/638/base 2025-12-04T12:46:35.4878683Z * [new branch] gh/bobrenjc93/638/head -> origin/gh/bobrenjc93/638/head 2025-12-04T12:46:35.4878755Z * [new branch] gh/bobrenjc93/638/orig -> origin/gh/bobrenjc93/638/orig 2025-12-04T12:46:35.4878828Z * [new branch] gh/bobrenjc93/653/base -> origin/gh/bobrenjc93/653/base 2025-12-04T12:46:35.4878901Z * [new branch] gh/bobrenjc93/653/head -> origin/gh/bobrenjc93/653/head 2025-12-04T12:46:35.4879007Z * [new branch] gh/bobrenjc93/653/orig -> origin/gh/bobrenjc93/653/orig 2025-12-04T12:46:35.4879084Z * [new branch] gh/bobrenjc93/654/base -> origin/gh/bobrenjc93/654/base 2025-12-04T12:46:35.4879156Z * [new branch] gh/bobrenjc93/654/head -> origin/gh/bobrenjc93/654/head 2025-12-04T12:46:35.4879228Z * [new branch] gh/bobrenjc93/654/orig -> origin/gh/bobrenjc93/654/orig 2025-12-04T12:46:35.4879301Z * [new branch] gh/bobrenjc93/657/base -> origin/gh/bobrenjc93/657/base 2025-12-04T12:46:35.4879374Z * [new branch] gh/bobrenjc93/657/head -> origin/gh/bobrenjc93/657/head 2025-12-04T12:46:35.4879445Z * [new branch] gh/bobrenjc93/657/orig -> origin/gh/bobrenjc93/657/orig 2025-12-04T12:46:35.4879519Z * [new branch] gh/bobrenjc93/672/base -> origin/gh/bobrenjc93/672/base 2025-12-04T12:46:35.4879593Z * [new branch] gh/bobrenjc93/672/head -> origin/gh/bobrenjc93/672/head 2025-12-04T12:46:35.4879666Z * [new branch] gh/bobrenjc93/672/orig -> origin/gh/bobrenjc93/672/orig 2025-12-04T12:46:35.4879738Z * [new branch] gh/bobrenjc93/679/base -> origin/gh/bobrenjc93/679/base 2025-12-04T12:46:35.4879812Z * [new branch] gh/bobrenjc93/679/head -> origin/gh/bobrenjc93/679/head 2025-12-04T12:46:35.4879886Z * [new branch] gh/bobrenjc93/679/orig -> origin/gh/bobrenjc93/679/orig 2025-12-04T12:46:35.4879959Z * [new branch] gh/bobrenjc93/680/base -> origin/gh/bobrenjc93/680/base 2025-12-04T12:46:35.4880032Z * [new branch] gh/bobrenjc93/680/head -> origin/gh/bobrenjc93/680/head 2025-12-04T12:46:35.4880106Z * [new branch] gh/bobrenjc93/680/orig -> origin/gh/bobrenjc93/680/orig 2025-12-04T12:46:35.4880177Z * [new branch] gh/bobrenjc93/681/base -> origin/gh/bobrenjc93/681/base 2025-12-04T12:46:35.4880250Z * [new branch] gh/bobrenjc93/681/head -> origin/gh/bobrenjc93/681/head 2025-12-04T12:46:35.4880325Z * [new branch] gh/bobrenjc93/681/orig -> origin/gh/bobrenjc93/681/orig 2025-12-04T12:46:35.4880397Z * [new branch] gh/bobrenjc93/682/base -> origin/gh/bobrenjc93/682/base 2025-12-04T12:46:35.4880468Z * [new branch] gh/bobrenjc93/682/head -> origin/gh/bobrenjc93/682/head 2025-12-04T12:46:35.4880542Z * [new branch] gh/bobrenjc93/682/orig -> origin/gh/bobrenjc93/682/orig 2025-12-04T12:46:35.4880615Z * [new branch] gh/bobrenjc93/683/base -> origin/gh/bobrenjc93/683/base 2025-12-04T12:46:35.4880688Z * [new branch] gh/bobrenjc93/683/head -> origin/gh/bobrenjc93/683/head 2025-12-04T12:46:35.4880763Z * [new branch] gh/bobrenjc93/683/orig -> origin/gh/bobrenjc93/683/orig 2025-12-04T12:46:35.4880834Z * [new branch] gh/bobrenjc93/684/base -> origin/gh/bobrenjc93/684/base 2025-12-04T12:46:35.4880908Z * [new branch] gh/bobrenjc93/684/head -> origin/gh/bobrenjc93/684/head 2025-12-04T12:46:35.4880982Z * [new branch] gh/bobrenjc93/684/orig -> origin/gh/bobrenjc93/684/orig 2025-12-04T12:46:35.4881053Z * [new branch] gh/bobrenjc93/685/base -> origin/gh/bobrenjc93/685/base 2025-12-04T12:46:35.4881124Z * [new branch] gh/bobrenjc93/685/head -> origin/gh/bobrenjc93/685/head 2025-12-04T12:46:35.4881197Z * [new branch] gh/bobrenjc93/685/orig -> origin/gh/bobrenjc93/685/orig 2025-12-04T12:46:35.4881299Z * [new branch] gh/bobrenjc93/686/base -> origin/gh/bobrenjc93/686/base 2025-12-04T12:46:35.4881373Z * [new branch] gh/bobrenjc93/686/head -> origin/gh/bobrenjc93/686/head 2025-12-04T12:46:35.4881446Z * [new branch] gh/bobrenjc93/686/orig -> origin/gh/bobrenjc93/686/orig 2025-12-04T12:46:35.4881519Z * [new branch] gh/bobrenjc93/687/base -> origin/gh/bobrenjc93/687/base 2025-12-04T12:46:35.4881620Z * [new branch] gh/bobrenjc93/687/head -> origin/gh/bobrenjc93/687/head 2025-12-04T12:46:35.4881692Z * [new branch] gh/bobrenjc93/687/orig -> origin/gh/bobrenjc93/687/orig 2025-12-04T12:46:35.4881763Z * [new branch] gh/bobrenjc93/688/base -> origin/gh/bobrenjc93/688/base 2025-12-04T12:46:35.4881836Z * [new branch] gh/bobrenjc93/688/head -> origin/gh/bobrenjc93/688/head 2025-12-04T12:46:35.4881909Z * [new branch] gh/bobrenjc93/688/orig -> origin/gh/bobrenjc93/688/orig 2025-12-04T12:46:35.4881981Z * [new branch] gh/bobrenjc93/689/base -> origin/gh/bobrenjc93/689/base 2025-12-04T12:46:35.4882054Z * [new branch] gh/bobrenjc93/689/head -> origin/gh/bobrenjc93/689/head 2025-12-04T12:46:35.4882126Z * [new branch] gh/bobrenjc93/689/orig -> origin/gh/bobrenjc93/689/orig 2025-12-04T12:46:35.4882199Z * [new branch] gh/bobrenjc93/690/base -> origin/gh/bobrenjc93/690/base 2025-12-04T12:46:35.4882274Z * [new branch] gh/bobrenjc93/690/head -> origin/gh/bobrenjc93/690/head 2025-12-04T12:46:35.4882347Z * [new branch] gh/bobrenjc93/690/orig -> origin/gh/bobrenjc93/690/orig 2025-12-04T12:46:35.4882418Z * [new branch] gh/bobrenjc93/691/base -> origin/gh/bobrenjc93/691/base 2025-12-04T12:46:35.4882491Z * [new branch] gh/bobrenjc93/691/head -> origin/gh/bobrenjc93/691/head 2025-12-04T12:46:35.4882563Z * [new branch] gh/bobrenjc93/691/orig -> origin/gh/bobrenjc93/691/orig 2025-12-04T12:46:35.4882635Z * [new branch] gh/bobrenjc93/692/base -> origin/gh/bobrenjc93/692/base 2025-12-04T12:46:35.4882708Z * [new branch] gh/bobrenjc93/692/head -> origin/gh/bobrenjc93/692/head 2025-12-04T12:46:35.4882780Z * [new branch] gh/bobrenjc93/692/orig -> origin/gh/bobrenjc93/692/orig 2025-12-04T12:46:35.4882854Z * [new branch] gh/bobrenjc93/693/base -> origin/gh/bobrenjc93/693/base 2025-12-04T12:46:35.4882926Z * [new branch] gh/bobrenjc93/693/head -> origin/gh/bobrenjc93/693/head 2025-12-04T12:46:35.4882997Z * [new branch] gh/bobrenjc93/693/orig -> origin/gh/bobrenjc93/693/orig 2025-12-04T12:46:35.4883072Z * [new branch] gh/bobrenjc93/694/base -> origin/gh/bobrenjc93/694/base 2025-12-04T12:46:35.4883145Z * [new branch] gh/bobrenjc93/694/head -> origin/gh/bobrenjc93/694/head 2025-12-04T12:46:35.4883217Z * [new branch] gh/bobrenjc93/694/orig -> origin/gh/bobrenjc93/694/orig 2025-12-04T12:46:35.4883291Z * [new branch] gh/bobrenjc93/695/base -> origin/gh/bobrenjc93/695/base 2025-12-04T12:46:35.4883362Z * [new branch] gh/bobrenjc93/695/head -> origin/gh/bobrenjc93/695/head 2025-12-04T12:46:35.4883433Z * [new branch] gh/bobrenjc93/695/orig -> origin/gh/bobrenjc93/695/orig 2025-12-04T12:46:35.4883504Z * [new branch] gh/c00w/23/base -> origin/gh/c00w/23/base 2025-12-04T12:46:35.4883569Z * [new branch] gh/c00w/23/head -> origin/gh/c00w/23/head 2025-12-04T12:46:35.4883632Z * [new branch] gh/c00w/53/base -> origin/gh/c00w/53/base 2025-12-04T12:46:35.4883696Z * [new branch] gh/c00w/53/head -> origin/gh/c00w/53/head 2025-12-04T12:46:35.4883759Z * [new branch] gh/c00w/53/orig -> origin/gh/c00w/53/orig 2025-12-04T12:46:35.4883843Z * [new branch] gh/c00w/54/base -> origin/gh/c00w/54/base 2025-12-04T12:46:35.4883907Z * [new branch] gh/c00w/54/head -> origin/gh/c00w/54/head 2025-12-04T12:46:35.4883969Z * [new branch] gh/c00w/54/orig -> origin/gh/c00w/54/orig 2025-12-04T12:46:35.4884030Z * [new branch] gh/c00w/56/base -> origin/gh/c00w/56/base 2025-12-04T12:46:35.4884118Z * [new branch] gh/c00w/56/head -> origin/gh/c00w/56/head 2025-12-04T12:46:35.4884179Z * [new branch] gh/c00w/56/orig -> origin/gh/c00w/56/orig 2025-12-04T12:46:35.4884241Z * [new branch] gh/c00w/57/base -> origin/gh/c00w/57/base 2025-12-04T12:46:35.4884305Z * [new branch] gh/c00w/57/head -> origin/gh/c00w/57/head 2025-12-04T12:46:35.4884367Z * [new branch] gh/c00w/57/orig -> origin/gh/c00w/57/orig 2025-12-04T12:46:35.4884431Z * [new branch] gh/c00w/58/base -> origin/gh/c00w/58/base 2025-12-04T12:46:35.4884494Z * [new branch] gh/c00w/58/head -> origin/gh/c00w/58/head 2025-12-04T12:46:35.4884556Z * [new branch] gh/c00w/58/orig -> origin/gh/c00w/58/orig 2025-12-04T12:46:35.4884628Z * [new branch] gh/clee2000/1/base -> origin/gh/clee2000/1/base 2025-12-04T12:46:35.4884698Z * [new branch] gh/clee2000/1/head -> origin/gh/clee2000/1/head 2025-12-04T12:46:35.4884765Z * [new branch] gh/clee2000/1/orig -> origin/gh/clee2000/1/orig 2025-12-04T12:46:35.4884844Z * [new branch] gh/coconutruben/1/base -> origin/gh/coconutruben/1/base 2025-12-04T12:46:35.4884918Z * [new branch] gh/coconutruben/1/head -> origin/gh/coconutruben/1/head 2025-12-04T12:46:35.4884996Z * [new branch] gh/coconutruben/55/base -> origin/gh/coconutruben/55/base 2025-12-04T12:46:35.4885073Z * [new branch] gh/coconutruben/55/head -> origin/gh/coconutruben/55/head 2025-12-04T12:46:35.4885150Z * [new branch] gh/coconutruben/55/orig -> origin/gh/coconutruben/55/orig 2025-12-04T12:46:35.4885225Z * [new branch] gh/coconutruben/57/base -> origin/gh/coconutruben/57/base 2025-12-04T12:46:35.4885300Z * [new branch] gh/coconutruben/57/head -> origin/gh/coconutruben/57/head 2025-12-04T12:46:35.4885376Z * [new branch] gh/coconutruben/57/orig -> origin/gh/coconutruben/57/orig 2025-12-04T12:46:35.4885449Z * [new branch] gh/coconutruben/70/base -> origin/gh/coconutruben/70/base 2025-12-04T12:46:35.4885524Z * [new branch] gh/coconutruben/70/head -> origin/gh/coconutruben/70/head 2025-12-04T12:46:35.4885599Z * [new branch] gh/coconutruben/70/orig -> origin/gh/coconutruben/70/orig 2025-12-04T12:46:35.4885673Z * [new branch] gh/coconutruben/71/base -> origin/gh/coconutruben/71/base 2025-12-04T12:46:35.4885749Z * [new branch] gh/coconutruben/71/head -> origin/gh/coconutruben/71/head 2025-12-04T12:46:35.4885823Z * [new branch] gh/coconutruben/71/orig -> origin/gh/coconutruben/71/orig 2025-12-04T12:46:35.4885897Z * [new branch] gh/coconutruben/72/base -> origin/gh/coconutruben/72/base 2025-12-04T12:46:35.4885973Z * [new branch] gh/coconutruben/72/head -> origin/gh/coconutruben/72/head 2025-12-04T12:46:35.4886048Z * [new branch] gh/coconutruben/72/orig -> origin/gh/coconutruben/72/orig 2025-12-04T12:46:35.4886123Z * [new branch] gh/coconutruben/73/base -> origin/gh/coconutruben/73/base 2025-12-04T12:46:35.4886197Z * [new branch] gh/coconutruben/73/head -> origin/gh/coconutruben/73/head 2025-12-04T12:46:35.4886271Z * [new branch] gh/coconutruben/73/orig -> origin/gh/coconutruben/73/orig 2025-12-04T12:46:35.4886345Z * [new branch] gh/coconutruben/74/base -> origin/gh/coconutruben/74/base 2025-12-04T12:46:35.4886441Z * [new branch] gh/coconutruben/74/head -> origin/gh/coconutruben/74/head 2025-12-04T12:46:35.4886515Z * [new branch] gh/coconutruben/74/orig -> origin/gh/coconutruben/74/orig 2025-12-04T12:46:35.4886590Z * [new branch] gh/coconutruben/79/base -> origin/gh/coconutruben/79/base 2025-12-04T12:46:35.4886664Z * [new branch] gh/coconutruben/79/head -> origin/gh/coconutruben/79/head 2025-12-04T12:46:35.4886765Z * [new branch] gh/coconutruben/79/orig -> origin/gh/coconutruben/79/orig 2025-12-04T12:46:35.4886840Z * [new branch] gh/coconutruben/80/base -> origin/gh/coconutruben/80/base 2025-12-04T12:46:35.4886915Z * [new branch] gh/coconutruben/80/head -> origin/gh/coconutruben/80/head 2025-12-04T12:46:35.4886989Z * [new branch] gh/coconutruben/80/orig -> origin/gh/coconutruben/80/orig 2025-12-04T12:46:35.4887066Z * [new branch] gh/coconutruben/82/base -> origin/gh/coconutruben/82/base 2025-12-04T12:46:35.4887142Z * [new branch] gh/coconutruben/82/head -> origin/gh/coconutruben/82/head 2025-12-04T12:46:35.4887217Z * [new branch] gh/coconutruben/82/orig -> origin/gh/coconutruben/82/orig 2025-12-04T12:46:35.4887293Z * [new branch] gh/coconutruben/83/base -> origin/gh/coconutruben/83/base 2025-12-04T12:46:35.4887369Z * [new branch] gh/coconutruben/83/head -> origin/gh/coconutruben/83/head 2025-12-04T12:46:35.4887444Z * [new branch] gh/coconutruben/83/orig -> origin/gh/coconutruben/83/orig 2025-12-04T12:46:35.4887563Z * [new branch] gh/coconutruben/84/base -> origin/gh/coconutruben/84/base 2025-12-04T12:46:35.4887640Z * [new branch] gh/coconutruben/84/head -> origin/gh/coconutruben/84/head 2025-12-04T12:46:35.4887716Z * [new branch] gh/coconutruben/84/orig -> origin/gh/coconutruben/84/orig 2025-12-04T12:46:35.4887793Z * [new branch] gh/coconutruben/85/base -> origin/gh/coconutruben/85/base 2025-12-04T12:46:35.4887867Z * [new branch] gh/coconutruben/85/head -> origin/gh/coconutruben/85/head 2025-12-04T12:46:35.4887945Z * [new branch] gh/coconutruben/85/orig -> origin/gh/coconutruben/85/orig 2025-12-04T12:46:35.4888019Z * [new branch] gh/coconutruben/86/base -> origin/gh/coconutruben/86/base 2025-12-04T12:46:35.4888095Z * [new branch] gh/coconutruben/86/head -> origin/gh/coconutruben/86/head 2025-12-04T12:46:35.4888171Z * [new branch] gh/coconutruben/86/orig -> origin/gh/coconutruben/86/orig 2025-12-04T12:46:35.4888247Z * [new branch] gh/colinchan15/1/base -> origin/gh/colinchan15/1/base 2025-12-04T12:46:35.4888320Z * [new branch] gh/colinchan15/1/head -> origin/gh/colinchan15/1/head 2025-12-04T12:46:35.4888394Z * [new branch] gh/colinchan15/2/base -> origin/gh/colinchan15/2/base 2025-12-04T12:46:35.4888467Z * [new branch] gh/colinchan15/2/head -> origin/gh/colinchan15/2/head 2025-12-04T12:46:35.4888538Z * [new branch] gh/colinchan15/3/base -> origin/gh/colinchan15/3/base 2025-12-04T12:46:35.4888610Z * [new branch] gh/colinchan15/3/head -> origin/gh/colinchan15/3/head 2025-12-04T12:46:35.4888681Z * [new branch] gh/colinchan15/6/base -> origin/gh/colinchan15/6/base 2025-12-04T12:46:35.4888754Z * [new branch] gh/colinchan15/6/head -> origin/gh/colinchan15/6/head 2025-12-04T12:46:35.4888820Z * [new branch] gh/d4l3k/1/base -> origin/gh/d4l3k/1/base 2025-12-04T12:46:35.4888884Z * [new branch] gh/d4l3k/1/head -> origin/gh/d4l3k/1/head 2025-12-04T12:46:35.4888947Z * [new branch] gh/d4l3k/2/base -> origin/gh/d4l3k/2/base 2025-12-04T12:46:35.4889011Z * [new branch] gh/d4l3k/2/head -> origin/gh/d4l3k/2/head 2025-12-04T12:46:35.4889130Z * [new branch] gh/d4l3k/2/orig -> origin/gh/d4l3k/2/orig 2025-12-04T12:46:35.4889192Z * [new branch] gh/d4l3k/3/base -> origin/gh/d4l3k/3/base 2025-12-04T12:46:35.4889255Z * [new branch] gh/d4l3k/3/head -> origin/gh/d4l3k/3/head 2025-12-04T12:46:35.4889317Z * [new branch] gh/d4l3k/3/orig -> origin/gh/d4l3k/3/orig 2025-12-04T12:46:35.4889417Z * [new branch] gh/d4l3k/4/base -> origin/gh/d4l3k/4/base 2025-12-04T12:46:35.4889480Z * [new branch] gh/d4l3k/4/head -> origin/gh/d4l3k/4/head 2025-12-04T12:46:35.4889542Z * [new branch] gh/d4l3k/4/orig -> origin/gh/d4l3k/4/orig 2025-12-04T12:46:35.4889604Z * [new branch] gh/d4l3k/5/base -> origin/gh/d4l3k/5/base 2025-12-04T12:46:35.4889666Z * [new branch] gh/d4l3k/5/orig -> origin/gh/d4l3k/5/orig 2025-12-04T12:46:35.4889753Z * [new branch] gh/davidberard98/392/base -> origin/gh/davidberard98/392/base 2025-12-04T12:46:35.4889838Z * [new branch] gh/davidberard98/392/head -> origin/gh/davidberard98/392/head 2025-12-04T12:46:35.4889920Z * [new branch] gh/davidberard98/392/orig -> origin/gh/davidberard98/392/orig 2025-12-04T12:46:35.4890000Z * [new branch] gh/davidberard98/399/base -> origin/gh/davidberard98/399/base 2025-12-04T12:46:35.4890083Z * [new branch] gh/davidberard98/399/head -> origin/gh/davidberard98/399/head 2025-12-04T12:46:35.4890163Z * [new branch] gh/davidberard98/399/orig -> origin/gh/davidberard98/399/orig 2025-12-04T12:46:35.4890238Z * [new branch] gh/desertfire/605/base -> origin/gh/desertfire/605/base 2025-12-04T12:46:35.4890313Z * [new branch] gh/desertfire/605/head -> origin/gh/desertfire/605/head 2025-12-04T12:46:35.4890386Z * [new branch] gh/desertfire/605/orig -> origin/gh/desertfire/605/orig 2025-12-04T12:46:35.4890461Z * [new branch] gh/desertfire/606/base -> origin/gh/desertfire/606/base 2025-12-04T12:46:35.4890536Z * [new branch] gh/desertfire/606/head -> origin/gh/desertfire/606/head 2025-12-04T12:46:35.4890608Z * [new branch] gh/desertfire/606/orig -> origin/gh/desertfire/606/orig 2025-12-04T12:46:35.4890683Z * [new branch] gh/desertfire/607/base -> origin/gh/desertfire/607/base 2025-12-04T12:46:35.4890760Z * [new branch] gh/desertfire/607/head -> origin/gh/desertfire/607/head 2025-12-04T12:46:35.4890833Z * [new branch] gh/desertfire/607/orig -> origin/gh/desertfire/607/orig 2025-12-04T12:46:35.4890906Z * [new branch] gh/desertfire/608/base -> origin/gh/desertfire/608/base 2025-12-04T12:46:35.4890977Z * [new branch] gh/desertfire/608/head -> origin/gh/desertfire/608/head 2025-12-04T12:46:35.4891048Z * [new branch] gh/desertfire/608/orig -> origin/gh/desertfire/608/orig 2025-12-04T12:46:35.4891122Z * [new branch] gh/desertfire/609/base -> origin/gh/desertfire/609/base 2025-12-04T12:46:35.4891195Z * [new branch] gh/desertfire/609/head -> origin/gh/desertfire/609/head 2025-12-04T12:46:35.4891268Z * [new branch] gh/desertfire/609/orig -> origin/gh/desertfire/609/orig 2025-12-04T12:46:35.4891344Z * [new branch] gh/desertfire/610/base -> origin/gh/desertfire/610/base 2025-12-04T12:46:35.4891416Z * [new branch] gh/desertfire/610/head -> origin/gh/desertfire/610/head 2025-12-04T12:46:35.4891489Z * [new branch] gh/desertfire/610/orig -> origin/gh/desertfire/610/orig 2025-12-04T12:46:35.4891563Z * [new branch] gh/desertfire/611/base -> origin/gh/desertfire/611/base 2025-12-04T12:46:35.4891636Z * [new branch] gh/desertfire/611/head -> origin/gh/desertfire/611/head 2025-12-04T12:46:35.4891738Z * [new branch] gh/desertfire/611/orig -> origin/gh/desertfire/611/orig 2025-12-04T12:46:35.4891814Z * [new branch] gh/desertfire/612/base -> origin/gh/desertfire/612/base 2025-12-04T12:46:35.4891886Z * [new branch] gh/desertfire/612/head -> origin/gh/desertfire/612/head 2025-12-04T12:46:35.4891959Z * [new branch] gh/desertfire/612/orig -> origin/gh/desertfire/612/orig 2025-12-04T12:46:35.4892062Z * [new branch] gh/desertfire/613/base -> origin/gh/desertfire/613/base 2025-12-04T12:46:35.4892136Z * [new branch] gh/desertfire/613/head -> origin/gh/desertfire/613/head 2025-12-04T12:46:35.4892209Z * [new branch] gh/desertfire/613/orig -> origin/gh/desertfire/613/orig 2025-12-04T12:46:35.4892285Z * [new branch] gh/desertfire/614/base -> origin/gh/desertfire/614/base 2025-12-04T12:46:35.4892358Z * [new branch] gh/desertfire/614/head -> origin/gh/desertfire/614/head 2025-12-04T12:46:35.4892434Z * [new branch] gh/desertfire/614/orig -> origin/gh/desertfire/614/orig 2025-12-04T12:46:35.4892509Z * [new branch] gh/desertfire/615/base -> origin/gh/desertfire/615/base 2025-12-04T12:46:35.4892582Z * [new branch] gh/desertfire/615/head -> origin/gh/desertfire/615/head 2025-12-04T12:46:35.4892656Z * [new branch] gh/desertfire/615/orig -> origin/gh/desertfire/615/orig 2025-12-04T12:46:35.4892731Z * [new branch] gh/desertfire/616/base -> origin/gh/desertfire/616/base 2025-12-04T12:46:35.4892803Z * [new branch] gh/desertfire/616/head -> origin/gh/desertfire/616/head 2025-12-04T12:46:35.4892878Z * [new branch] gh/desertfire/616/orig -> origin/gh/desertfire/616/orig 2025-12-04T12:46:35.4892950Z * [new branch] gh/desertfire/617/base -> origin/gh/desertfire/617/base 2025-12-04T12:46:35.4893022Z * [new branch] gh/desertfire/617/head -> origin/gh/desertfire/617/head 2025-12-04T12:46:35.4893098Z * [new branch] gh/desertfire/617/orig -> origin/gh/desertfire/617/orig 2025-12-04T12:46:35.4893169Z * [new branch] gh/dharakk/1/base -> origin/gh/dharakk/1/base 2025-12-04T12:46:35.4893239Z * [new branch] gh/dharakk/1/head -> origin/gh/dharakk/1/head 2025-12-04T12:46:35.4893314Z * [new branch] gh/drisspg/170/base -> origin/gh/drisspg/170/base 2025-12-04T12:46:35.4893387Z * [new branch] gh/drisspg/170/head -> origin/gh/drisspg/170/head 2025-12-04T12:46:35.4893458Z * [new branch] gh/drisspg/170/orig -> origin/gh/drisspg/170/orig 2025-12-04T12:46:35.4893529Z * [new branch] gh/drisspg/182/base -> origin/gh/drisspg/182/base 2025-12-04T12:46:35.4893597Z * [new branch] gh/drisspg/182/head -> origin/gh/drisspg/182/head 2025-12-04T12:46:35.4893665Z * [new branch] gh/drisspg/183/base -> origin/gh/drisspg/183/base 2025-12-04T12:46:35.4893737Z * [new branch] gh/drisspg/183/head -> origin/gh/drisspg/183/head 2025-12-04T12:46:35.4893807Z * [new branch] gh/drisspg/184/base -> origin/gh/drisspg/184/base 2025-12-04T12:46:35.4893874Z * [new branch] gh/drisspg/184/head -> origin/gh/drisspg/184/head 2025-12-04T12:46:35.4893946Z * [new branch] gh/drisspg/185/base -> origin/gh/drisspg/185/base 2025-12-04T12:46:35.4894016Z * [new branch] gh/drisspg/185/head -> origin/gh/drisspg/185/head 2025-12-04T12:46:35.4894087Z * [new branch] gh/drisspg/194/base -> origin/gh/drisspg/194/base 2025-12-04T12:46:35.4894156Z * [new branch] gh/drisspg/194/head -> origin/gh/drisspg/194/head 2025-12-04T12:46:35.4894224Z * [new branch] gh/drisspg/194/orig -> origin/gh/drisspg/194/orig 2025-12-04T12:46:35.4894296Z * [new branch] gh/drisspg/200/base -> origin/gh/drisspg/200/base 2025-12-04T12:46:35.4894388Z * [new branch] gh/drisspg/200/head -> origin/gh/drisspg/200/head 2025-12-04T12:46:35.4894457Z * [new branch] gh/drisspg/200/orig -> origin/gh/drisspg/200/orig 2025-12-04T12:46:35.4894527Z * [new branch] gh/drisspg/218/base -> origin/gh/drisspg/218/base 2025-12-04T12:46:35.4894627Z * [new branch] gh/drisspg/218/head -> origin/gh/drisspg/218/head 2025-12-04T12:46:35.4894695Z * [new branch] gh/drisspg/218/orig -> origin/gh/drisspg/218/orig 2025-12-04T12:46:35.4894766Z * [new branch] gh/drisspg/219/base -> origin/gh/drisspg/219/base 2025-12-04T12:46:35.4894834Z * [new branch] gh/drisspg/219/head -> origin/gh/drisspg/219/head 2025-12-04T12:46:35.4894903Z * [new branch] gh/drisspg/219/orig -> origin/gh/drisspg/219/orig 2025-12-04T12:46:35.4894974Z * [new branch] gh/drisspg/220/base -> origin/gh/drisspg/220/base 2025-12-04T12:46:35.4895044Z * [new branch] gh/drisspg/220/head -> origin/gh/drisspg/220/head 2025-12-04T12:46:35.4895114Z * [new branch] gh/drisspg/220/orig -> origin/gh/drisspg/220/orig 2025-12-04T12:46:35.4895186Z * [new branch] gh/drisspg/221/base -> origin/gh/drisspg/221/base 2025-12-04T12:46:35.4895256Z * [new branch] gh/drisspg/221/head -> origin/gh/drisspg/221/head 2025-12-04T12:46:35.4895324Z * [new branch] gh/drisspg/221/orig -> origin/gh/drisspg/221/orig 2025-12-04T12:46:35.4895395Z * [new branch] gh/drisspg/222/base -> origin/gh/drisspg/222/base 2025-12-04T12:46:35.4895463Z * [new branch] gh/drisspg/222/head -> origin/gh/drisspg/222/head 2025-12-04T12:46:35.4895531Z * [new branch] gh/drisspg/222/orig -> origin/gh/drisspg/222/orig 2025-12-04T12:46:35.4895602Z * [new branch] gh/drisspg/223/base -> origin/gh/drisspg/223/base 2025-12-04T12:46:35.4895671Z * [new branch] gh/drisspg/223/head -> origin/gh/drisspg/223/head 2025-12-04T12:46:35.4895739Z * [new branch] gh/drisspg/223/orig -> origin/gh/drisspg/223/orig 2025-12-04T12:46:35.4895811Z * [new branch] gh/drisspg/224/base -> origin/gh/drisspg/224/base 2025-12-04T12:46:35.4895881Z * [new branch] gh/drisspg/224/head -> origin/gh/drisspg/224/head 2025-12-04T12:46:35.4895952Z * [new branch] gh/drisspg/224/orig -> origin/gh/drisspg/224/orig 2025-12-04T12:46:35.4896019Z * [new branch] gh/drisspg/225/base -> origin/gh/drisspg/225/base 2025-12-04T12:46:35.4896087Z * [new branch] gh/drisspg/225/head -> origin/gh/drisspg/225/head 2025-12-04T12:46:35.4896158Z * [new branch] gh/drisspg/225/orig -> origin/gh/drisspg/225/orig 2025-12-04T12:46:35.4896227Z * [new branch] gh/drisspg/226/base -> origin/gh/drisspg/226/base 2025-12-04T12:46:35.4896296Z * [new branch] gh/drisspg/226/head -> origin/gh/drisspg/226/head 2025-12-04T12:46:35.4896367Z * [new branch] gh/drisspg/226/orig -> origin/gh/drisspg/226/orig 2025-12-04T12:46:35.4896436Z * [new branch] gh/drisspg/227/base -> origin/gh/drisspg/227/base 2025-12-04T12:46:35.4896506Z * [new branch] gh/drisspg/227/head -> origin/gh/drisspg/227/head 2025-12-04T12:46:35.4896576Z * [new branch] gh/drisspg/227/orig -> origin/gh/drisspg/227/orig 2025-12-04T12:46:35.4896645Z * [new branch] gh/drisspg/228/base -> origin/gh/drisspg/228/base 2025-12-04T12:46:35.4896714Z * [new branch] gh/drisspg/228/head -> origin/gh/drisspg/228/head 2025-12-04T12:46:35.4896784Z * [new branch] gh/drisspg/228/orig -> origin/gh/drisspg/228/orig 2025-12-04T12:46:35.4896853Z * [new branch] gh/drisspg/229/base -> origin/gh/drisspg/229/base 2025-12-04T12:46:35.4896943Z * [new branch] gh/drisspg/229/head -> origin/gh/drisspg/229/head 2025-12-04T12:46:35.4897015Z * [new branch] gh/drisspg/229/orig -> origin/gh/drisspg/229/orig 2025-12-04T12:46:35.4897083Z * [new branch] gh/drisspg/230/base -> origin/gh/drisspg/230/base 2025-12-04T12:46:35.4897176Z * [new branch] gh/drisspg/230/head -> origin/gh/drisspg/230/head 2025-12-04T12:46:35.4897247Z * [new branch] gh/drisspg/230/orig -> origin/gh/drisspg/230/orig 2025-12-04T12:46:35.4897319Z * [new branch] gh/dsjohns2/1/base -> origin/gh/dsjohns2/1/base 2025-12-04T12:46:35.4897390Z * [new branch] gh/dsjohns2/1/head -> origin/gh/dsjohns2/1/head 2025-12-04T12:46:35.4897594Z * [new branch] gh/dzmitry-huba/1/base -> origin/gh/dzmitry-huba/1/base 2025-12-04T12:46:35.4897673Z * [new branch] gh/dzmitry-huba/1/head -> origin/gh/dzmitry-huba/1/head 2025-12-04T12:46:35.4897753Z * [new branch] gh/dzmitry-huba/12/base -> origin/gh/dzmitry-huba/12/base 2025-12-04T12:46:35.4897830Z * [new branch] gh/dzmitry-huba/12/head -> origin/gh/dzmitry-huba/12/head 2025-12-04T12:46:35.4897907Z * [new branch] gh/dzmitry-huba/12/orig -> origin/gh/dzmitry-huba/12/orig 2025-12-04T12:46:35.4897986Z * [new branch] gh/dzmitry-huba/13/base -> origin/gh/dzmitry-huba/13/base 2025-12-04T12:46:35.4898061Z * [new branch] gh/dzmitry-huba/13/head -> origin/gh/dzmitry-huba/13/head 2025-12-04T12:46:35.4898136Z * [new branch] gh/dzmitry-huba/13/orig -> origin/gh/dzmitry-huba/13/orig 2025-12-04T12:46:35.4898213Z * [new branch] gh/dzmitry-huba/14/base -> origin/gh/dzmitry-huba/14/base 2025-12-04T12:46:35.4898287Z * [new branch] gh/dzmitry-huba/14/head -> origin/gh/dzmitry-huba/14/head 2025-12-04T12:46:35.4898363Z * [new branch] gh/dzmitry-huba/14/orig -> origin/gh/dzmitry-huba/14/orig 2025-12-04T12:46:35.4898441Z * [new branch] gh/dzmitry-huba/15/base -> origin/gh/dzmitry-huba/15/base 2025-12-04T12:46:35.4898516Z * [new branch] gh/dzmitry-huba/15/head -> origin/gh/dzmitry-huba/15/head 2025-12-04T12:46:35.4898592Z * [new branch] gh/dzmitry-huba/15/orig -> origin/gh/dzmitry-huba/15/orig 2025-12-04T12:46:35.4898670Z * [new branch] gh/dzmitry-huba/16/base -> origin/gh/dzmitry-huba/16/base 2025-12-04T12:46:35.4898745Z * [new branch] gh/dzmitry-huba/16/head -> origin/gh/dzmitry-huba/16/head 2025-12-04T12:46:35.4898819Z * [new branch] gh/dzmitry-huba/16/orig -> origin/gh/dzmitry-huba/16/orig 2025-12-04T12:46:35.4898896Z * [new branch] gh/dzmitry-huba/17/base -> origin/gh/dzmitry-huba/17/base 2025-12-04T12:46:35.4898973Z * [new branch] gh/dzmitry-huba/17/head -> origin/gh/dzmitry-huba/17/head 2025-12-04T12:46:35.4899051Z * [new branch] gh/dzmitry-huba/17/orig -> origin/gh/dzmitry-huba/17/orig 2025-12-04T12:46:35.4899131Z * [new branch] gh/dzmitry-huba/2/base -> origin/gh/dzmitry-huba/2/base 2025-12-04T12:46:35.4899206Z * [new branch] gh/dzmitry-huba/2/head -> origin/gh/dzmitry-huba/2/head 2025-12-04T12:46:35.4899284Z * [new branch] gh/dzmitry-huba/3/base -> origin/gh/dzmitry-huba/3/base 2025-12-04T12:46:35.4899363Z * [new branch] gh/dzmitry-huba/3/head -> origin/gh/dzmitry-huba/3/head 2025-12-04T12:46:35.4899438Z * [new branch] gh/eellison/808/base -> origin/gh/eellison/808/base 2025-12-04T12:46:35.4899514Z * [new branch] gh/eellison/808/head -> origin/gh/eellison/808/head 2025-12-04T12:46:35.4899586Z * [new branch] gh/eellison/808/orig -> origin/gh/eellison/808/orig 2025-12-04T12:46:35.4899657Z * [new branch] gh/eellison/822/base -> origin/gh/eellison/822/base 2025-12-04T12:46:35.4899775Z * [new branch] gh/eellison/822/head -> origin/gh/eellison/822/head 2025-12-04T12:46:35.4899846Z * [new branch] gh/eellison/822/orig -> origin/gh/eellison/822/orig 2025-12-04T12:46:35.4899916Z * [new branch] gh/eellison/823/base -> origin/gh/eellison/823/base 2025-12-04T12:46:35.4899988Z * [new branch] gh/eellison/823/head -> origin/gh/eellison/823/head 2025-12-04T12:46:35.4900099Z * [new branch] gh/eellison/823/orig -> origin/gh/eellison/823/orig 2025-12-04T12:46:35.4900169Z * [new branch] gh/eellison/862/base -> origin/gh/eellison/862/base 2025-12-04T12:46:35.4900242Z * [new branch] gh/eellison/862/head -> origin/gh/eellison/862/head 2025-12-04T12:46:35.4900313Z * [new branch] gh/eellison/862/orig -> origin/gh/eellison/862/orig 2025-12-04T12:46:35.4900383Z * [new branch] gh/eellison/863/base -> origin/gh/eellison/863/base 2025-12-04T12:46:35.4900457Z * [new branch] gh/eellison/863/head -> origin/gh/eellison/863/head 2025-12-04T12:46:35.4900527Z * [new branch] gh/eellison/863/orig -> origin/gh/eellison/863/orig 2025-12-04T12:46:35.4900597Z * [new branch] gh/eellison/864/base -> origin/gh/eellison/864/base 2025-12-04T12:46:35.4900671Z * [new branch] gh/eellison/864/head -> origin/gh/eellison/864/head 2025-12-04T12:46:35.4900741Z * [new branch] gh/eellison/864/orig -> origin/gh/eellison/864/orig 2025-12-04T12:46:35.4900811Z * [new branch] gh/eellison/865/base -> origin/gh/eellison/865/base 2025-12-04T12:46:35.4900885Z * [new branch] gh/eellison/865/head -> origin/gh/eellison/865/head 2025-12-04T12:46:35.4900955Z * [new branch] gh/eellison/865/orig -> origin/gh/eellison/865/orig 2025-12-04T12:46:35.4901028Z * [new branch] gh/eellison/866/base -> origin/gh/eellison/866/base 2025-12-04T12:46:35.4901100Z * [new branch] gh/eellison/866/head -> origin/gh/eellison/866/head 2025-12-04T12:46:35.4901170Z * [new branch] gh/eellison/866/orig -> origin/gh/eellison/866/orig 2025-12-04T12:46:35.4901243Z * [new branch] gh/eellison/867/base -> origin/gh/eellison/867/base 2025-12-04T12:46:35.4901315Z * [new branch] gh/eellison/867/head -> origin/gh/eellison/867/head 2025-12-04T12:46:35.4901385Z * [new branch] gh/eellison/867/orig -> origin/gh/eellison/867/orig 2025-12-04T12:46:35.4901458Z * [new branch] gh/eellison/868/base -> origin/gh/eellison/868/base 2025-12-04T12:46:35.4901528Z * [new branch] gh/eellison/868/head -> origin/gh/eellison/868/head 2025-12-04T12:46:35.4901598Z * [new branch] gh/eellison/868/orig -> origin/gh/eellison/868/orig 2025-12-04T12:46:35.4901671Z * [new branch] gh/eellison/869/base -> origin/gh/eellison/869/base 2025-12-04T12:46:35.4901741Z * [new branch] gh/eellison/869/head -> origin/gh/eellison/869/head 2025-12-04T12:46:35.4901811Z * [new branch] gh/eellison/869/orig -> origin/gh/eellison/869/orig 2025-12-04T12:46:35.4901885Z * [new branch] gh/eellison/870/base -> origin/gh/eellison/870/base 2025-12-04T12:46:35.4901956Z * [new branch] gh/eellison/870/head -> origin/gh/eellison/870/head 2025-12-04T12:46:35.4902026Z * [new branch] gh/eellison/870/orig -> origin/gh/eellison/870/orig 2025-12-04T12:46:35.4902098Z * [new branch] gh/eellison/871/base -> origin/gh/eellison/871/base 2025-12-04T12:46:35.4902168Z * [new branch] gh/eellison/871/head -> origin/gh/eellison/871/head 2025-12-04T12:46:35.4902238Z * [new branch] gh/eellison/871/orig -> origin/gh/eellison/871/orig 2025-12-04T12:46:35.4902311Z * [new branch] gh/eellison/872/base -> origin/gh/eellison/872/base 2025-12-04T12:46:35.4902411Z * [new branch] gh/eellison/872/head -> origin/gh/eellison/872/head 2025-12-04T12:46:35.4902485Z * [new branch] gh/eellison/872/orig -> origin/gh/eellison/872/orig 2025-12-04T12:46:35.4902555Z * [new branch] gh/eellison/873/base -> origin/gh/eellison/873/base 2025-12-04T12:46:35.4902649Z * [new branch] gh/eellison/873/head -> origin/gh/eellison/873/head 2025-12-04T12:46:35.4902722Z * [new branch] gh/eellison/873/orig -> origin/gh/eellison/873/orig 2025-12-04T12:46:35.4902791Z * [new branch] gh/eellison/874/base -> origin/gh/eellison/874/base 2025-12-04T12:46:35.4902861Z * [new branch] gh/eellison/874/head -> origin/gh/eellison/874/head 2025-12-04T12:46:35.4902933Z * [new branch] gh/eellison/874/orig -> origin/gh/eellison/874/orig 2025-12-04T12:46:35.4903003Z * [new branch] gh/eellison/875/base -> origin/gh/eellison/875/base 2025-12-04T12:46:35.4903075Z * [new branch] gh/eellison/875/head -> origin/gh/eellison/875/head 2025-12-04T12:46:35.4903148Z * [new branch] gh/eellison/875/orig -> origin/gh/eellison/875/orig 2025-12-04T12:46:35.4903219Z * [new branch] gh/eellison/876/base -> origin/gh/eellison/876/base 2025-12-04T12:46:35.4903292Z * [new branch] gh/eellison/876/head -> origin/gh/eellison/876/head 2025-12-04T12:46:35.4903369Z * [new branch] gh/eellison/876/orig -> origin/gh/eellison/876/orig 2025-12-04T12:46:35.4903440Z * [new branch] gh/eellison/877/base -> origin/gh/eellison/877/base 2025-12-04T12:46:35.4903512Z * [new branch] gh/eellison/877/head -> origin/gh/eellison/877/head 2025-12-04T12:46:35.4903589Z * [new branch] gh/eellison/877/orig -> origin/gh/eellison/877/orig 2025-12-04T12:46:35.4903661Z * [new branch] gh/eellison/878/base -> origin/gh/eellison/878/base 2025-12-04T12:46:35.4903734Z * [new branch] gh/eellison/878/head -> origin/gh/eellison/878/head 2025-12-04T12:46:35.4903811Z * [new branch] gh/eellison/878/orig -> origin/gh/eellison/878/orig 2025-12-04T12:46:35.4903883Z * [new branch] gh/eellison/879/base -> origin/gh/eellison/879/base 2025-12-04T12:46:35.4903961Z * [new branch] gh/eellison/879/head -> origin/gh/eellison/879/head 2025-12-04T12:46:35.4904033Z * [new branch] gh/eellison/879/orig -> origin/gh/eellison/879/orig 2025-12-04T12:46:35.4904104Z * [new branch] gh/eellison/880/base -> origin/gh/eellison/880/base 2025-12-04T12:46:35.4904180Z * [new branch] gh/eellison/880/head -> origin/gh/eellison/880/head 2025-12-04T12:46:35.4904252Z * [new branch] gh/eellison/880/orig -> origin/gh/eellison/880/orig 2025-12-04T12:46:35.4904325Z * [new branch] gh/eellison/881/base -> origin/gh/eellison/881/base 2025-12-04T12:46:35.4904400Z * [new branch] gh/eellison/881/head -> origin/gh/eellison/881/head 2025-12-04T12:46:35.4904472Z * [new branch] gh/eellison/881/orig -> origin/gh/eellison/881/orig 2025-12-04T12:46:35.4904544Z * [new branch] gh/eellison/882/base -> origin/gh/eellison/882/base 2025-12-04T12:46:35.4904621Z * [new branch] gh/eellison/882/head -> origin/gh/eellison/882/head 2025-12-04T12:46:35.4904692Z * [new branch] gh/eellison/882/orig -> origin/gh/eellison/882/orig 2025-12-04T12:46:35.4904763Z * [new branch] gh/eellison/883/base -> origin/gh/eellison/883/base 2025-12-04T12:46:35.4904839Z * [new branch] gh/eellison/883/head -> origin/gh/eellison/883/head 2025-12-04T12:46:35.4904911Z * [new branch] gh/eellison/883/orig -> origin/gh/eellison/883/orig 2025-12-04T12:46:35.4905005Z * [new branch] gh/eellison/884/base -> origin/gh/eellison/884/base 2025-12-04T12:46:35.4905083Z * [new branch] gh/eellison/884/head -> origin/gh/eellison/884/head 2025-12-04T12:46:35.4905153Z * [new branch] gh/eellison/884/orig -> origin/gh/eellison/884/orig 2025-12-04T12:46:35.4905220Z * [new branch] gh/etaf/147/base -> origin/gh/etaf/147/base 2025-12-04T12:46:35.4905314Z * [new branch] gh/etaf/147/head -> origin/gh/etaf/147/head 2025-12-04T12:46:35.4905380Z * [new branch] gh/etaf/154/base -> origin/gh/etaf/154/base 2025-12-04T12:46:35.4905447Z * [new branch] gh/etaf/154/head -> origin/gh/etaf/154/head 2025-12-04T12:46:35.4905515Z * [new branch] gh/etaf/154/orig -> origin/gh/etaf/154/orig 2025-12-04T12:46:35.4905579Z * [new branch] gh/etaf/156/base -> origin/gh/etaf/156/base 2025-12-04T12:46:35.4905647Z * [new branch] gh/etaf/156/head -> origin/gh/etaf/156/head 2025-12-04T12:46:35.4905711Z * [new branch] gh/etaf/156/orig -> origin/gh/etaf/156/orig 2025-12-04T12:46:35.4905775Z * [new branch] gh/etaf/157/base -> origin/gh/etaf/157/base 2025-12-04T12:46:35.4905841Z * [new branch] gh/etaf/157/head -> origin/gh/etaf/157/head 2025-12-04T12:46:35.4905907Z * [new branch] gh/etaf/157/orig -> origin/gh/etaf/157/orig 2025-12-04T12:46:35.4905972Z * [new branch] gh/etaf/158/base -> origin/gh/etaf/158/base 2025-12-04T12:46:35.4906042Z * [new branch] gh/etaf/158/head -> origin/gh/etaf/158/head 2025-12-04T12:46:35.4906106Z * [new branch] gh/etaf/158/orig -> origin/gh/etaf/158/orig 2025-12-04T12:46:35.4906170Z * [new branch] gh/etaf/159/base -> origin/gh/etaf/159/base 2025-12-04T12:46:35.4906237Z * [new branch] gh/etaf/159/head -> origin/gh/etaf/159/head 2025-12-04T12:46:35.4906302Z * [new branch] gh/etaf/159/orig -> origin/gh/etaf/159/orig 2025-12-04T12:46:35.4906367Z * [new branch] gh/etaf/160/base -> origin/gh/etaf/160/base 2025-12-04T12:46:35.4906434Z * [new branch] gh/etaf/160/head -> origin/gh/etaf/160/head 2025-12-04T12:46:35.4906502Z * [new branch] gh/etaf/160/orig -> origin/gh/etaf/160/orig 2025-12-04T12:46:35.4906568Z * [new branch] gh/etaf/161/base -> origin/gh/etaf/161/base 2025-12-04T12:46:35.4906635Z * [new branch] gh/etaf/161/head -> origin/gh/etaf/161/head 2025-12-04T12:46:35.4906699Z * [new branch] gh/etaf/161/orig -> origin/gh/etaf/161/orig 2025-12-04T12:46:35.4906764Z * [new branch] gh/etaf/166/base -> origin/gh/etaf/166/base 2025-12-04T12:46:35.4906831Z * [new branch] gh/etaf/166/head -> origin/gh/etaf/166/head 2025-12-04T12:46:35.4906897Z * [new branch] gh/etaf/166/orig -> origin/gh/etaf/166/orig 2025-12-04T12:46:35.4906961Z * [new branch] gh/etaf/167/base -> origin/gh/etaf/167/base 2025-12-04T12:46:35.4907029Z * [new branch] gh/etaf/167/head -> origin/gh/etaf/167/head 2025-12-04T12:46:35.4907095Z * [new branch] gh/etaf/167/orig -> origin/gh/etaf/167/orig 2025-12-04T12:46:35.4907163Z * [new branch] gh/etaf/168/base -> origin/gh/etaf/168/base 2025-12-04T12:46:35.4907228Z * [new branch] gh/etaf/168/head -> origin/gh/etaf/168/head 2025-12-04T12:46:35.4907292Z * [new branch] gh/etaf/168/orig -> origin/gh/etaf/168/orig 2025-12-04T12:46:35.4907358Z * [new branch] gh/etaf/172/base -> origin/gh/etaf/172/base 2025-12-04T12:46:35.4907424Z * [new branch] gh/etaf/172/head -> origin/gh/etaf/172/head 2025-12-04T12:46:35.4907569Z * [new branch] gh/etaf/172/orig -> origin/gh/etaf/172/orig 2025-12-04T12:46:35.4907637Z * [new branch] gh/etaf/173/base -> origin/gh/etaf/173/base 2025-12-04T12:46:35.4907701Z * [new branch] gh/etaf/173/head -> origin/gh/etaf/173/head 2025-12-04T12:46:35.4907766Z * [new branch] gh/etaf/173/orig -> origin/gh/etaf/173/orig 2025-12-04T12:46:35.4907889Z * [new branch] gh/etaf/174/base -> origin/gh/etaf/174/base 2025-12-04T12:46:35.4907953Z * [new branch] gh/etaf/174/head -> origin/gh/etaf/174/head 2025-12-04T12:46:35.4908017Z * [new branch] gh/etaf/175/base -> origin/gh/etaf/175/base 2025-12-04T12:46:35.4908082Z * [new branch] gh/etaf/175/head -> origin/gh/etaf/175/head 2025-12-04T12:46:35.4908147Z * [new branch] gh/etaf/175/orig -> origin/gh/etaf/175/orig 2025-12-04T12:46:35.4908212Z * [new branch] gh/etaf/176/base -> origin/gh/etaf/176/base 2025-12-04T12:46:35.4908282Z * [new branch] gh/etaf/176/head -> origin/gh/etaf/176/head 2025-12-04T12:46:35.4908346Z * [new branch] gh/etaf/176/orig -> origin/gh/etaf/176/orig 2025-12-04T12:46:35.4908409Z * [new branch] gh/etaf/177/base -> origin/gh/etaf/177/base 2025-12-04T12:46:35.4908476Z * [new branch] gh/etaf/177/head -> origin/gh/etaf/177/head 2025-12-04T12:46:35.4908542Z * [new branch] gh/etaf/177/orig -> origin/gh/etaf/177/orig 2025-12-04T12:46:35.4908606Z * [new branch] gh/etaf/178/base -> origin/gh/etaf/178/base 2025-12-04T12:46:35.4908672Z * [new branch] gh/etaf/178/head -> origin/gh/etaf/178/head 2025-12-04T12:46:35.4908735Z * [new branch] gh/etaf/178/orig -> origin/gh/etaf/178/orig 2025-12-04T12:46:35.4908802Z * [new branch] gh/etaf/179/base -> origin/gh/etaf/179/base 2025-12-04T12:46:35.4908868Z * [new branch] gh/etaf/179/head -> origin/gh/etaf/179/head 2025-12-04T12:46:35.4908933Z * [new branch] gh/etaf/179/orig -> origin/gh/etaf/179/orig 2025-12-04T12:46:35.4908998Z * [new branch] gh/etaf/180/base -> origin/gh/etaf/180/base 2025-12-04T12:46:35.4909064Z * [new branch] gh/etaf/180/head -> origin/gh/etaf/180/head 2025-12-04T12:46:35.4909127Z * [new branch] gh/etaf/180/orig -> origin/gh/etaf/180/orig 2025-12-04T12:46:35.4909208Z * [new branch] gh/exclamaforte/1/base -> origin/gh/exclamaforte/1/base 2025-12-04T12:46:35.4909286Z * [new branch] gh/exclamaforte/1/head -> origin/gh/exclamaforte/1/head 2025-12-04T12:46:35.4909362Z * [new branch] gh/exclamaforte/2/base -> origin/gh/exclamaforte/2/base 2025-12-04T12:46:35.4909441Z * [new branch] gh/exclamaforte/2/head -> origin/gh/exclamaforte/2/head 2025-12-04T12:46:35.4909516Z * [new branch] gh/exclamaforte/3/base -> origin/gh/exclamaforte/3/base 2025-12-04T12:46:35.4909591Z * [new branch] gh/exclamaforte/3/head -> origin/gh/exclamaforte/3/head 2025-12-04T12:46:35.4909670Z * [new branch] gh/exclamaforte/4/base -> origin/gh/exclamaforte/4/base 2025-12-04T12:46:35.4909746Z * [new branch] gh/exclamaforte/4/head -> origin/gh/exclamaforte/4/head 2025-12-04T12:46:35.4909816Z * [new branch] gh/ezyang/2374/base -> origin/gh/ezyang/2374/base 2025-12-04T12:46:35.4909887Z * [new branch] gh/ezyang/2374/head -> origin/gh/ezyang/2374/head 2025-12-04T12:46:35.4909955Z * [new branch] gh/ezyang/2374/orig -> origin/gh/ezyang/2374/orig 2025-12-04T12:46:35.4910026Z * [new branch] gh/ezyang/2973/base -> origin/gh/ezyang/2973/base 2025-12-04T12:46:35.4910119Z * [new branch] gh/ezyang/2973/head -> origin/gh/ezyang/2973/head 2025-12-04T12:46:35.4910188Z * [new branch] gh/ezyang/2973/orig -> origin/gh/ezyang/2973/orig 2025-12-04T12:46:35.4910256Z * [new branch] gh/ezyang/2974/base -> origin/gh/ezyang/2974/base 2025-12-04T12:46:35.4910326Z * [new branch] gh/ezyang/2974/head -> origin/gh/ezyang/2974/head 2025-12-04T12:46:35.4910419Z * [new branch] gh/ezyang/2974/orig -> origin/gh/ezyang/2974/orig 2025-12-04T12:46:35.4910490Z * [new branch] gh/ezyang/3131/base -> origin/gh/ezyang/3131/base 2025-12-04T12:46:35.4910557Z * [new branch] gh/ezyang/3131/head -> origin/gh/ezyang/3131/head 2025-12-04T12:46:35.4910624Z * [new branch] gh/ezyang/3131/orig -> origin/gh/ezyang/3131/orig 2025-12-04T12:46:35.4910694Z * [new branch] gh/ezyang/3139/base -> origin/gh/ezyang/3139/base 2025-12-04T12:46:35.4910763Z * [new branch] gh/ezyang/3139/head -> origin/gh/ezyang/3139/head 2025-12-04T12:46:35.4910830Z * [new branch] gh/ezyang/3139/orig -> origin/gh/ezyang/3139/orig 2025-12-04T12:46:35.4910902Z * [new branch] gh/ezyang/3140/base -> origin/gh/ezyang/3140/base 2025-12-04T12:46:35.4910969Z * [new branch] gh/ezyang/3140/head -> origin/gh/ezyang/3140/head 2025-12-04T12:46:35.4911038Z * [new branch] gh/ezyang/3140/orig -> origin/gh/ezyang/3140/orig 2025-12-04T12:46:35.4911107Z * [new branch] gh/ezyang/3143/base -> origin/gh/ezyang/3143/base 2025-12-04T12:46:35.4911177Z * [new branch] gh/ezyang/3143/head -> origin/gh/ezyang/3143/head 2025-12-04T12:46:35.4911246Z * [new branch] gh/ezyang/3143/orig -> origin/gh/ezyang/3143/orig 2025-12-04T12:46:35.4911316Z * [new branch] gh/ezyang/3144/base -> origin/gh/ezyang/3144/base 2025-12-04T12:46:35.4911384Z * [new branch] gh/ezyang/3144/head -> origin/gh/ezyang/3144/head 2025-12-04T12:46:35.4911451Z * [new branch] gh/ezyang/3144/orig -> origin/gh/ezyang/3144/orig 2025-12-04T12:46:35.4911520Z * [new branch] gh/ezyang/3167/base -> origin/gh/ezyang/3167/base 2025-12-04T12:46:35.4911588Z * [new branch] gh/ezyang/3167/head -> origin/gh/ezyang/3167/head 2025-12-04T12:46:35.4911657Z * [new branch] gh/ezyang/3167/orig -> origin/gh/ezyang/3167/orig 2025-12-04T12:46:35.4911728Z * [new branch] gh/ezyang/3173/base -> origin/gh/ezyang/3173/base 2025-12-04T12:46:35.4911797Z * [new branch] gh/ezyang/3173/head -> origin/gh/ezyang/3173/head 2025-12-04T12:46:35.4911865Z * [new branch] gh/ezyang/3173/orig -> origin/gh/ezyang/3173/orig 2025-12-04T12:46:35.4911934Z * [new branch] gh/ezyang/3175/base -> origin/gh/ezyang/3175/base 2025-12-04T12:46:35.4912002Z * [new branch] gh/ezyang/3175/head -> origin/gh/ezyang/3175/head 2025-12-04T12:46:35.4912071Z * [new branch] gh/ezyang/3175/orig -> origin/gh/ezyang/3175/orig 2025-12-04T12:46:35.4912139Z * [new branch] gh/ezyang/3182/base -> origin/gh/ezyang/3182/base 2025-12-04T12:46:35.4912208Z * [new branch] gh/ezyang/3182/head -> origin/gh/ezyang/3182/head 2025-12-04T12:46:35.4912281Z * [new branch] gh/ezyang/3182/orig -> origin/gh/ezyang/3182/orig 2025-12-04T12:46:35.4923313Z * [new branch] gh/ezyang/3185/base -> origin/gh/ezyang/3185/base 2025-12-04T12:46:35.4923388Z * [new branch] gh/ezyang/3185/head -> origin/gh/ezyang/3185/head 2025-12-04T12:46:35.4923468Z * [new branch] gh/ezyang/3185/orig -> origin/gh/ezyang/3185/orig 2025-12-04T12:46:35.4923538Z * [new branch] gh/ezyang/3189/base -> origin/gh/ezyang/3189/base 2025-12-04T12:46:35.4923671Z * [new branch] gh/ezyang/3189/head -> origin/gh/ezyang/3189/head 2025-12-04T12:46:35.4923747Z * [new branch] gh/ezyang/3189/orig -> origin/gh/ezyang/3189/orig 2025-12-04T12:46:35.4923818Z * [new branch] gh/ezyang/3191/base -> origin/gh/ezyang/3191/base 2025-12-04T12:46:35.4923889Z * [new branch] gh/ezyang/3191/head -> origin/gh/ezyang/3191/head 2025-12-04T12:46:35.4924003Z * [new branch] gh/ezyang/3191/orig -> origin/gh/ezyang/3191/orig 2025-12-04T12:46:35.4924074Z * [new branch] gh/ezyang/3192/base -> origin/gh/ezyang/3192/base 2025-12-04T12:46:35.4924145Z * [new branch] gh/ezyang/3192/head -> origin/gh/ezyang/3192/head 2025-12-04T12:46:35.4924219Z * [new branch] gh/ezyang/3192/orig -> origin/gh/ezyang/3192/orig 2025-12-04T12:46:35.4924293Z * [new branch] gh/ezyang/3193/base -> origin/gh/ezyang/3193/base 2025-12-04T12:46:35.4924367Z * [new branch] gh/ezyang/3193/head -> origin/gh/ezyang/3193/head 2025-12-04T12:46:35.4924442Z * [new branch] gh/ezyang/3193/orig -> origin/gh/ezyang/3193/orig 2025-12-04T12:46:35.4924515Z * [new branch] gh/ezyang/3194/base -> origin/gh/ezyang/3194/base 2025-12-04T12:46:35.4924588Z * [new branch] gh/ezyang/3194/head -> origin/gh/ezyang/3194/head 2025-12-04T12:46:35.4924669Z * [new branch] gh/ezyang/3194/orig -> origin/gh/ezyang/3194/orig 2025-12-04T12:46:35.4924741Z * [new branch] gh/ezyang/3195/base -> origin/gh/ezyang/3195/base 2025-12-04T12:46:35.4924811Z * [new branch] gh/ezyang/3195/head -> origin/gh/ezyang/3195/head 2025-12-04T12:46:35.4924883Z * [new branch] gh/ezyang/3195/orig -> origin/gh/ezyang/3195/orig 2025-12-04T12:46:35.4924951Z * [new branch] gh/ezyang/3196/base -> origin/gh/ezyang/3196/base 2025-12-04T12:46:35.4925022Z * [new branch] gh/ezyang/3196/head -> origin/gh/ezyang/3196/head 2025-12-04T12:46:35.4925094Z * [new branch] gh/ezyang/3196/orig -> origin/gh/ezyang/3196/orig 2025-12-04T12:46:35.4925163Z * [new branch] gh/ezyang/3197/base -> origin/gh/ezyang/3197/base 2025-12-04T12:46:35.4925234Z * [new branch] gh/ezyang/3197/head -> origin/gh/ezyang/3197/head 2025-12-04T12:46:35.4925305Z * [new branch] gh/ezyang/3197/orig -> origin/gh/ezyang/3197/orig 2025-12-04T12:46:35.4925375Z * [new branch] gh/ezyang/3198/base -> origin/gh/ezyang/3198/base 2025-12-04T12:46:35.4925445Z * [new branch] gh/ezyang/3198/head -> origin/gh/ezyang/3198/head 2025-12-04T12:46:35.4925513Z * [new branch] gh/ezyang/3198/orig -> origin/gh/ezyang/3198/orig 2025-12-04T12:46:35.4925580Z * [new branch] gh/ezyang/3199/base -> origin/gh/ezyang/3199/base 2025-12-04T12:46:35.4925652Z * [new branch] gh/ezyang/3199/head -> origin/gh/ezyang/3199/head 2025-12-04T12:46:35.4925719Z * [new branch] gh/ezyang/3199/orig -> origin/gh/ezyang/3199/orig 2025-12-04T12:46:35.4925788Z * [new branch] gh/ezyang/3200/base -> origin/gh/ezyang/3200/base 2025-12-04T12:46:35.4925859Z * [new branch] gh/ezyang/3200/head -> origin/gh/ezyang/3200/head 2025-12-04T12:46:35.4925928Z * [new branch] gh/ezyang/3200/orig -> origin/gh/ezyang/3200/orig 2025-12-04T12:46:35.4925998Z * [new branch] gh/ezyang/3201/base -> origin/gh/ezyang/3201/base 2025-12-04T12:46:35.4926068Z * [new branch] gh/ezyang/3201/head -> origin/gh/ezyang/3201/head 2025-12-04T12:46:35.4926138Z * [new branch] gh/ezyang/3201/orig -> origin/gh/ezyang/3201/orig 2025-12-04T12:46:35.4926206Z * [new branch] gh/ezyang/3202/base -> origin/gh/ezyang/3202/base 2025-12-04T12:46:35.4926301Z * [new branch] gh/ezyang/3202/head -> origin/gh/ezyang/3202/head 2025-12-04T12:46:35.4926371Z * [new branch] gh/ezyang/3202/orig -> origin/gh/ezyang/3202/orig 2025-12-04T12:46:35.4926439Z * [new branch] gh/ezyang/3203/base -> origin/gh/ezyang/3203/base 2025-12-04T12:46:35.4926508Z * [new branch] gh/ezyang/3203/head -> origin/gh/ezyang/3203/head 2025-12-04T12:46:35.4926600Z * [new branch] gh/ezyang/3203/orig -> origin/gh/ezyang/3203/orig 2025-12-04T12:46:35.4926668Z * [new branch] gh/ezyang/3204/base -> origin/gh/ezyang/3204/base 2025-12-04T12:46:35.4926740Z * [new branch] gh/ezyang/3204/head -> origin/gh/ezyang/3204/head 2025-12-04T12:46:35.4926808Z * [new branch] gh/ezyang/3204/orig -> origin/gh/ezyang/3204/orig 2025-12-04T12:46:35.4926876Z * [new branch] gh/ezyang/3205/base -> origin/gh/ezyang/3205/base 2025-12-04T12:46:35.4926945Z * [new branch] gh/ezyang/3205/head -> origin/gh/ezyang/3205/head 2025-12-04T12:46:35.4927014Z * [new branch] gh/ezyang/3205/orig -> origin/gh/ezyang/3205/orig 2025-12-04T12:46:35.4927082Z * [new branch] gh/ezyang/3206/base -> origin/gh/ezyang/3206/base 2025-12-04T12:46:35.4927149Z * [new branch] gh/ezyang/3206/head -> origin/gh/ezyang/3206/head 2025-12-04T12:46:35.4927218Z * [new branch] gh/ezyang/3206/orig -> origin/gh/ezyang/3206/orig 2025-12-04T12:46:35.4927286Z * [new branch] gh/ezyang/3207/base -> origin/gh/ezyang/3207/base 2025-12-04T12:46:35.4927354Z * [new branch] gh/ezyang/3207/head -> origin/gh/ezyang/3207/head 2025-12-04T12:46:35.4927421Z * [new branch] gh/ezyang/3207/orig -> origin/gh/ezyang/3207/orig 2025-12-04T12:46:35.4927535Z * [new branch] gh/ezyang/3208/base -> origin/gh/ezyang/3208/base 2025-12-04T12:46:35.4927606Z * [new branch] gh/ezyang/3208/head -> origin/gh/ezyang/3208/head 2025-12-04T12:46:35.4927675Z * [new branch] gh/ezyang/3208/orig -> origin/gh/ezyang/3208/orig 2025-12-04T12:46:35.4927743Z * [new branch] gh/ezyang/3209/base -> origin/gh/ezyang/3209/base 2025-12-04T12:46:35.4927812Z * [new branch] gh/ezyang/3209/head -> origin/gh/ezyang/3209/head 2025-12-04T12:46:35.4927882Z * [new branch] gh/ezyang/3209/orig -> origin/gh/ezyang/3209/orig 2025-12-04T12:46:35.4927958Z * [new branch] gh/fadara01/3/base -> origin/gh/fadara01/3/base 2025-12-04T12:46:35.4928028Z * [new branch] gh/fadara01/3/head -> origin/gh/fadara01/3/head 2025-12-04T12:46:35.4928097Z * [new branch] gh/fadara01/3/orig -> origin/gh/fadara01/3/orig 2025-12-04T12:46:35.4928167Z * [new branch] gh/fadara01/5/base -> origin/gh/fadara01/5/base 2025-12-04T12:46:35.4928237Z * [new branch] gh/fadara01/5/head -> origin/gh/fadara01/5/head 2025-12-04T12:46:35.4928307Z * [new branch] gh/fadara01/5/orig -> origin/gh/fadara01/5/orig 2025-12-04T12:46:35.4928378Z * [new branch] gh/fadara01/6/base -> origin/gh/fadara01/6/base 2025-12-04T12:46:35.4928446Z * [new branch] gh/fadara01/6/head -> origin/gh/fadara01/6/head 2025-12-04T12:46:35.4928516Z * [new branch] gh/fadara01/6/orig -> origin/gh/fadara01/6/orig 2025-12-04T12:46:35.4928584Z * [new branch] gh/fadara01/7/base -> origin/gh/fadara01/7/base 2025-12-04T12:46:35.4928651Z * [new branch] gh/fadara01/7/head -> origin/gh/fadara01/7/head 2025-12-04T12:46:35.4928720Z * [new branch] gh/fadara01/7/orig -> origin/gh/fadara01/7/orig 2025-12-04T12:46:35.4928788Z * [new branch] gh/fadara01/8/base -> origin/gh/fadara01/8/base 2025-12-04T12:46:35.4928902Z * [new branch] gh/fadara01/8/head -> origin/gh/fadara01/8/head 2025-12-04T12:46:35.4928976Z * [new branch] gh/fadara01/8/orig -> origin/gh/fadara01/8/orig 2025-12-04T12:46:35.4929046Z * [new branch] gh/fadara01/9/base -> origin/gh/fadara01/9/base 2025-12-04T12:46:35.4929117Z * [new branch] gh/fadara01/9/head -> origin/gh/fadara01/9/head 2025-12-04T12:46:35.4929261Z * [new branch] gh/fadara01/9/orig -> origin/gh/fadara01/9/orig 2025-12-04T12:46:35.4929331Z * [new branch] gh/fduwjj/182/base -> origin/gh/fduwjj/182/base 2025-12-04T12:46:35.4929403Z * [new branch] gh/fduwjj/182/head -> origin/gh/fduwjj/182/head 2025-12-04T12:46:35.4929479Z * [new branch] gh/fduwjj/182/orig -> origin/gh/fduwjj/182/orig 2025-12-04T12:46:35.4929549Z * [new branch] gh/fduwjj/211/base -> origin/gh/fduwjj/211/base 2025-12-04T12:46:35.4929620Z * [new branch] gh/fduwjj/211/head -> origin/gh/fduwjj/211/head 2025-12-04T12:46:35.4929692Z * [new branch] gh/fduwjj/211/orig -> origin/gh/fduwjj/211/orig 2025-12-04T12:46:35.4929763Z * [new branch] gh/fduwjj/212/base -> origin/gh/fduwjj/212/base 2025-12-04T12:46:35.4929830Z * [new branch] gh/fduwjj/212/head -> origin/gh/fduwjj/212/head 2025-12-04T12:46:35.4929904Z * [new branch] gh/fduwjj/212/orig -> origin/gh/fduwjj/212/orig 2025-12-04T12:46:35.4929974Z * [new branch] gh/fduwjj/213/base -> origin/gh/fduwjj/213/base 2025-12-04T12:46:35.4930044Z * [new branch] gh/fduwjj/213/head -> origin/gh/fduwjj/213/head 2025-12-04T12:46:35.4930115Z * [new branch] gh/fduwjj/213/orig -> origin/gh/fduwjj/213/orig 2025-12-04T12:46:35.4930183Z * [new branch] gh/fduwjj/226/base -> origin/gh/fduwjj/226/base 2025-12-04T12:46:35.4930252Z * [new branch] gh/fduwjj/226/head -> origin/gh/fduwjj/226/head 2025-12-04T12:46:35.4930322Z * [new branch] gh/fduwjj/226/orig -> origin/gh/fduwjj/226/orig 2025-12-04T12:46:35.4930390Z * [new branch] gh/fduwjj/229/base -> origin/gh/fduwjj/229/base 2025-12-04T12:46:35.4930465Z * [new branch] gh/fduwjj/229/head -> origin/gh/fduwjj/229/head 2025-12-04T12:46:35.4930536Z * [new branch] gh/fduwjj/229/orig -> origin/gh/fduwjj/229/orig 2025-12-04T12:46:35.4930606Z * [new branch] gh/fduwjj/233/base -> origin/gh/fduwjj/233/base 2025-12-04T12:46:35.4930678Z * [new branch] gh/fduwjj/233/head -> origin/gh/fduwjj/233/head 2025-12-04T12:46:35.4930746Z * [new branch] gh/fduwjj/233/orig -> origin/gh/fduwjj/233/orig 2025-12-04T12:46:35.4930813Z * [new branch] gh/fduwjj/234/base -> origin/gh/fduwjj/234/base 2025-12-04T12:46:35.4930887Z * [new branch] gh/fduwjj/234/head -> origin/gh/fduwjj/234/head 2025-12-04T12:46:35.4930954Z * [new branch] gh/fduwjj/234/orig -> origin/gh/fduwjj/234/orig 2025-12-04T12:46:35.4931021Z * [new branch] gh/fduwjj/235/base -> origin/gh/fduwjj/235/base 2025-12-04T12:46:35.4931090Z * [new branch] gh/fduwjj/235/head -> origin/gh/fduwjj/235/head 2025-12-04T12:46:35.4931160Z * [new branch] gh/fduwjj/235/orig -> origin/gh/fduwjj/235/orig 2025-12-04T12:46:35.4931228Z * [new branch] gh/fduwjj/236/base -> origin/gh/fduwjj/236/base 2025-12-04T12:46:35.4931302Z * [new branch] gh/fduwjj/236/head -> origin/gh/fduwjj/236/head 2025-12-04T12:46:35.4931371Z * [new branch] gh/fduwjj/236/orig -> origin/gh/fduwjj/236/orig 2025-12-04T12:46:35.4931440Z * [new branch] gh/fduwjj/237/base -> origin/gh/fduwjj/237/base 2025-12-04T12:46:35.4931510Z * [new branch] gh/fduwjj/237/head -> origin/gh/fduwjj/237/head 2025-12-04T12:46:35.4931605Z * [new branch] gh/fduwjj/237/orig -> origin/gh/fduwjj/237/orig 2025-12-04T12:46:35.4931673Z * [new branch] gh/fduwjj/238/base -> origin/gh/fduwjj/238/base 2025-12-04T12:46:35.4931741Z * [new branch] gh/fduwjj/238/head -> origin/gh/fduwjj/238/head 2025-12-04T12:46:35.4931835Z * [new branch] gh/fduwjj/238/orig -> origin/gh/fduwjj/238/orig 2025-12-04T12:46:35.4931903Z * [new branch] gh/fduwjj/239/base -> origin/gh/fduwjj/239/base 2025-12-04T12:46:35.4931971Z * [new branch] gh/fduwjj/239/head -> origin/gh/fduwjj/239/head 2025-12-04T12:46:35.4932039Z * [new branch] gh/fduwjj/239/orig -> origin/gh/fduwjj/239/orig 2025-12-04T12:46:35.4932110Z * [new branch] gh/fegin/332/base -> origin/gh/fegin/332/base 2025-12-04T12:46:35.4932180Z * [new branch] gh/fegin/332/head -> origin/gh/fegin/332/head 2025-12-04T12:46:35.4932249Z * [new branch] gh/fegin/332/orig -> origin/gh/fegin/332/orig 2025-12-04T12:46:35.4932322Z * [new branch] gh/fegin/333/base -> origin/gh/fegin/333/base 2025-12-04T12:46:35.4932389Z * [new branch] gh/fegin/333/head -> origin/gh/fegin/333/head 2025-12-04T12:46:35.4932459Z * [new branch] gh/fegin/333/orig -> origin/gh/fegin/333/orig 2025-12-04T12:46:35.4932526Z * [new branch] gh/fegin/334/base -> origin/gh/fegin/334/base 2025-12-04T12:46:35.4932593Z * [new branch] gh/fegin/334/head -> origin/gh/fegin/334/head 2025-12-04T12:46:35.4932661Z * [new branch] gh/fegin/334/orig -> origin/gh/fegin/334/orig 2025-12-04T12:46:35.4932730Z * [new branch] gh/fegin/335/base -> origin/gh/fegin/335/base 2025-12-04T12:46:35.4932796Z * [new branch] gh/fegin/335/head -> origin/gh/fegin/335/head 2025-12-04T12:46:35.4932863Z * [new branch] gh/fegin/335/orig -> origin/gh/fegin/335/orig 2025-12-04T12:46:35.4932933Z * [new branch] gh/fffrog/160/base -> origin/gh/fffrog/160/base 2025-12-04T12:46:35.4933003Z * [new branch] gh/fffrog/160/head -> origin/gh/fffrog/160/head 2025-12-04T12:46:35.4933072Z * [new branch] gh/fffrog/177/base -> origin/gh/fffrog/177/base 2025-12-04T12:46:35.4933140Z * [new branch] gh/fffrog/177/head -> origin/gh/fffrog/177/head 2025-12-04T12:46:35.4933208Z * [new branch] gh/fffrog/177/orig -> origin/gh/fffrog/177/orig 2025-12-04T12:46:35.4933274Z * [new branch] gh/fffrog/178/base -> origin/gh/fffrog/178/base 2025-12-04T12:46:35.4933344Z * [new branch] gh/fffrog/178/head -> origin/gh/fffrog/178/head 2025-12-04T12:46:35.4933410Z * [new branch] gh/fffrog/178/orig -> origin/gh/fffrog/178/orig 2025-12-04T12:46:35.4933478Z * [new branch] gh/fffrog/181/base -> origin/gh/fffrog/181/base 2025-12-04T12:46:35.4933547Z * [new branch] gh/fffrog/181/head -> origin/gh/fffrog/181/head 2025-12-04T12:46:35.4933615Z * [new branch] gh/fffrog/181/orig -> origin/gh/fffrog/181/orig 2025-12-04T12:46:35.4933684Z * [new branch] gh/fffrog/183/base -> origin/gh/fffrog/183/base 2025-12-04T12:46:35.4933755Z * [new branch] gh/fffrog/183/head -> origin/gh/fffrog/183/head 2025-12-04T12:46:35.4933822Z * [new branch] gh/fffrog/183/orig -> origin/gh/fffrog/183/orig 2025-12-04T12:46:35.4933892Z * [new branch] gh/fxdawnn/10/base -> origin/gh/fxdawnn/10/base 2025-12-04T12:46:35.4933959Z * [new branch] gh/fxdawnn/10/head -> origin/gh/fxdawnn/10/head 2025-12-04T12:46:35.4934027Z * [new branch] gh/fxdawnn/10/orig -> origin/gh/fxdawnn/10/orig 2025-12-04T12:46:35.4934120Z * [new branch] gh/fxdawnn/11/base -> origin/gh/fxdawnn/11/base 2025-12-04T12:46:35.4934188Z * [new branch] gh/fxdawnn/11/head -> origin/gh/fxdawnn/11/head 2025-12-04T12:46:35.4934255Z * [new branch] gh/fxdawnn/11/orig -> origin/gh/fxdawnn/11/orig 2025-12-04T12:46:35.4934350Z * [new branch] gh/fxdawnn/12/base -> origin/gh/fxdawnn/12/base 2025-12-04T12:46:35.4934419Z * [new branch] gh/fxdawnn/12/head -> origin/gh/fxdawnn/12/head 2025-12-04T12:46:35.4934486Z * [new branch] gh/fxdawnn/12/orig -> origin/gh/fxdawnn/12/orig 2025-12-04T12:46:35.4934557Z * [new branch] gh/fxdawnn/13/base -> origin/gh/fxdawnn/13/base 2025-12-04T12:46:35.4934624Z * [new branch] gh/fxdawnn/13/head -> origin/gh/fxdawnn/13/head 2025-12-04T12:46:35.4934691Z * [new branch] gh/fxdawnn/13/orig -> origin/gh/fxdawnn/13/orig 2025-12-04T12:46:35.4934761Z * [new branch] gh/fxdawnn/14/base -> origin/gh/fxdawnn/14/base 2025-12-04T12:46:35.4934828Z * [new branch] gh/fxdawnn/14/head -> origin/gh/fxdawnn/14/head 2025-12-04T12:46:35.4934895Z * [new branch] gh/fxdawnn/14/orig -> origin/gh/fxdawnn/14/orig 2025-12-04T12:46:35.4934964Z * [new branch] gh/fxdawnn/15/base -> origin/gh/fxdawnn/15/base 2025-12-04T12:46:35.4935034Z * [new branch] gh/fxdawnn/15/head -> origin/gh/fxdawnn/15/head 2025-12-04T12:46:35.4935102Z * [new branch] gh/fxdawnn/15/orig -> origin/gh/fxdawnn/15/orig 2025-12-04T12:46:35.4935172Z * [new branch] gh/fxdawnn/6/base -> origin/gh/fxdawnn/6/base 2025-12-04T12:46:35.4935240Z * [new branch] gh/fxdawnn/6/head -> origin/gh/fxdawnn/6/head 2025-12-04T12:46:35.4935307Z * [new branch] gh/fxdawnn/6/orig -> origin/gh/fxdawnn/6/orig 2025-12-04T12:46:35.4935378Z * [new branch] gh/fxdawnn/7/base -> origin/gh/fxdawnn/7/base 2025-12-04T12:46:35.4935445Z * [new branch] gh/fxdawnn/7/head -> origin/gh/fxdawnn/7/head 2025-12-04T12:46:35.4935515Z * [new branch] gh/fxdawnn/7/orig -> origin/gh/fxdawnn/7/orig 2025-12-04T12:46:35.4935582Z * [new branch] gh/fxdawnn/9/base -> origin/gh/fxdawnn/9/base 2025-12-04T12:46:35.4935652Z * [new branch] gh/fxdawnn/9/head -> origin/gh/fxdawnn/9/head 2025-12-04T12:46:35.4935719Z * [new branch] gh/fxdawnn/9/orig -> origin/gh/fxdawnn/9/orig 2025-12-04T12:46:35.4935786Z * [new branch] gh/galv/1/base -> origin/gh/galv/1/base 2025-12-04T12:46:35.4935851Z * [new branch] gh/galv/1/head -> origin/gh/galv/1/head 2025-12-04T12:46:35.4935917Z * [new branch] gh/galv/1/orig -> origin/gh/galv/1/orig 2025-12-04T12:46:35.4935980Z * [new branch] gh/galv/2/base -> origin/gh/galv/2/base 2025-12-04T12:46:35.4936043Z * [new branch] gh/galv/2/head -> origin/gh/galv/2/head 2025-12-04T12:46:35.4936108Z * [new branch] gh/galv/2/orig -> origin/gh/galv/2/orig 2025-12-04T12:46:35.4936171Z * [new branch] gh/galv/3/base -> origin/gh/galv/3/base 2025-12-04T12:46:35.4936236Z * [new branch] gh/galv/3/head -> origin/gh/galv/3/head 2025-12-04T12:46:35.4936299Z * [new branch] gh/galv/3/orig -> origin/gh/galv/3/orig 2025-12-04T12:46:35.4936377Z * [new branch] gh/guangyey/134/base -> origin/gh/guangyey/134/base 2025-12-04T12:46:35.4936452Z * [new branch] gh/guangyey/134/head -> origin/gh/guangyey/134/head 2025-12-04T12:46:35.4936526Z * [new branch] gh/guangyey/134/orig -> origin/gh/guangyey/134/orig 2025-12-04T12:46:35.4936621Z * [new branch] gh/guangyey/163/base -> origin/gh/guangyey/163/base 2025-12-04T12:46:35.4936693Z * [new branch] gh/guangyey/163/head -> origin/gh/guangyey/163/head 2025-12-04T12:46:35.4936766Z * [new branch] gh/guangyey/163/orig -> origin/gh/guangyey/163/orig 2025-12-04T12:46:35.4936836Z * [new branch] gh/guangyey/168/base -> origin/gh/guangyey/168/base 2025-12-04T12:46:35.4936930Z * [new branch] gh/guangyey/168/head -> origin/gh/guangyey/168/head 2025-12-04T12:46:35.4937001Z * [new branch] gh/guangyey/168/orig -> origin/gh/guangyey/168/orig 2025-12-04T12:46:35.4937070Z * [new branch] gh/guangyey/169/base -> origin/gh/guangyey/169/base 2025-12-04T12:46:35.4937141Z * [new branch] gh/guangyey/169/head -> origin/gh/guangyey/169/head 2025-12-04T12:46:35.4937211Z * [new branch] gh/guangyey/169/orig -> origin/gh/guangyey/169/orig 2025-12-04T12:46:35.4937283Z * [new branch] gh/guangyey/170/base -> origin/gh/guangyey/170/base 2025-12-04T12:46:35.4937354Z * [new branch] gh/guangyey/170/head -> origin/gh/guangyey/170/head 2025-12-04T12:46:35.4937425Z * [new branch] gh/guangyey/170/orig -> origin/gh/guangyey/170/orig 2025-12-04T12:46:35.4937537Z * [new branch] gh/guangyey/171/base -> origin/gh/guangyey/171/base 2025-12-04T12:46:35.4937612Z * [new branch] gh/guangyey/171/head -> origin/gh/guangyey/171/head 2025-12-04T12:46:35.4937682Z * [new branch] gh/guangyey/171/orig -> origin/gh/guangyey/171/orig 2025-12-04T12:46:35.4937753Z * [new branch] gh/guangyey/178/base -> origin/gh/guangyey/178/base 2025-12-04T12:46:35.4937825Z * [new branch] gh/guangyey/178/head -> origin/gh/guangyey/178/head 2025-12-04T12:46:35.4937895Z * [new branch] gh/guangyey/178/orig -> origin/gh/guangyey/178/orig 2025-12-04T12:46:35.4937965Z * [new branch] gh/guangyey/182/base -> origin/gh/guangyey/182/base 2025-12-04T12:46:35.4938038Z * [new branch] gh/guangyey/182/head -> origin/gh/guangyey/182/head 2025-12-04T12:46:35.4938108Z * [new branch] gh/guangyey/182/orig -> origin/gh/guangyey/182/orig 2025-12-04T12:46:35.4938178Z * [new branch] gh/guangyey/183/base -> origin/gh/guangyey/183/base 2025-12-04T12:46:35.4938252Z * [new branch] gh/guangyey/183/head -> origin/gh/guangyey/183/head 2025-12-04T12:46:35.4938322Z * [new branch] gh/guangyey/183/orig -> origin/gh/guangyey/183/orig 2025-12-04T12:46:35.4938393Z * [new branch] gh/guangyey/185/base -> origin/gh/guangyey/185/base 2025-12-04T12:46:35.4938465Z * [new branch] gh/guangyey/185/head -> origin/gh/guangyey/185/head 2025-12-04T12:46:35.4938536Z * [new branch] gh/guangyey/185/orig -> origin/gh/guangyey/185/orig 2025-12-04T12:46:35.4938608Z * [new branch] gh/guangyey/186/base -> origin/gh/guangyey/186/base 2025-12-04T12:46:35.4938679Z * [new branch] gh/guangyey/186/head -> origin/gh/guangyey/186/head 2025-12-04T12:46:35.4938752Z * [new branch] gh/guangyey/186/orig -> origin/gh/guangyey/186/orig 2025-12-04T12:46:35.4938822Z * [new branch] gh/guangyey/187/base -> origin/gh/guangyey/187/base 2025-12-04T12:46:35.4938895Z * [new branch] gh/guangyey/187/head -> origin/gh/guangyey/187/head 2025-12-04T12:46:35.4938967Z * [new branch] gh/guangyey/187/orig -> origin/gh/guangyey/187/orig 2025-12-04T12:46:35.4939038Z * [new branch] gh/guangyey/188/base -> origin/gh/guangyey/188/base 2025-12-04T12:46:35.4939108Z * [new branch] gh/guangyey/188/head -> origin/gh/guangyey/188/head 2025-12-04T12:46:35.4939179Z * [new branch] gh/guangyey/188/orig -> origin/gh/guangyey/188/orig 2025-12-04T12:46:35.4939309Z * [new branch] gh/guangyey/190/base -> origin/gh/guangyey/190/base 2025-12-04T12:46:35.4939380Z * [new branch] gh/guangyey/190/head -> origin/gh/guangyey/190/head 2025-12-04T12:46:35.4939451Z * [new branch] gh/guangyey/190/orig -> origin/gh/guangyey/190/orig 2025-12-04T12:46:35.4939525Z * [new branch] gh/guangyey/208/base -> origin/gh/guangyey/208/base 2025-12-04T12:46:35.4939679Z * [new branch] gh/guangyey/208/head -> origin/gh/guangyey/208/head 2025-12-04T12:46:35.4939749Z * [new branch] gh/guangyey/208/orig -> origin/gh/guangyey/208/orig 2025-12-04T12:46:35.4939823Z * [new branch] gh/guangyey/228/base -> origin/gh/guangyey/228/base 2025-12-04T12:46:35.4939892Z * [new branch] gh/guangyey/228/head -> origin/gh/guangyey/228/head 2025-12-04T12:46:35.4939963Z * [new branch] gh/guangyey/228/orig -> origin/gh/guangyey/228/orig 2025-12-04T12:46:35.4940035Z * [new branch] gh/guangyey/230/base -> origin/gh/guangyey/230/base 2025-12-04T12:46:35.4940104Z * [new branch] gh/guangyey/230/head -> origin/gh/guangyey/230/head 2025-12-04T12:46:35.4940175Z * [new branch] gh/guangyey/230/orig -> origin/gh/guangyey/230/orig 2025-12-04T12:46:35.4940246Z * [new branch] gh/guangyey/231/base -> origin/gh/guangyey/231/base 2025-12-04T12:46:35.4940317Z * [new branch] gh/guangyey/231/head -> origin/gh/guangyey/231/head 2025-12-04T12:46:35.4940389Z * [new branch] gh/guangyey/231/orig -> origin/gh/guangyey/231/orig 2025-12-04T12:46:35.4940460Z * [new branch] gh/guangyey/232/base -> origin/gh/guangyey/232/base 2025-12-04T12:46:35.4940531Z * [new branch] gh/guangyey/232/head -> origin/gh/guangyey/232/head 2025-12-04T12:46:35.4940602Z * [new branch] gh/guangyey/232/orig -> origin/gh/guangyey/232/orig 2025-12-04T12:46:35.4940674Z * [new branch] gh/guangyey/233/base -> origin/gh/guangyey/233/base 2025-12-04T12:46:35.4940744Z * [new branch] gh/guangyey/233/head -> origin/gh/guangyey/233/head 2025-12-04T12:46:35.4940816Z * [new branch] gh/guangyey/233/orig -> origin/gh/guangyey/233/orig 2025-12-04T12:46:35.4940889Z * [new branch] gh/guangyey/234/base -> origin/gh/guangyey/234/base 2025-12-04T12:46:35.4940962Z * [new branch] gh/guangyey/234/head -> origin/gh/guangyey/234/head 2025-12-04T12:46:35.4941034Z * [new branch] gh/guangyey/234/orig -> origin/gh/guangyey/234/orig 2025-12-04T12:46:35.4941106Z * [new branch] gh/guangyey/235/base -> origin/gh/guangyey/235/base 2025-12-04T12:46:35.4941176Z * [new branch] gh/guangyey/235/head -> origin/gh/guangyey/235/head 2025-12-04T12:46:35.4941248Z * [new branch] gh/guangyey/235/orig -> origin/gh/guangyey/235/orig 2025-12-04T12:46:35.4941319Z * [new branch] gh/guangyey/236/base -> origin/gh/guangyey/236/base 2025-12-04T12:46:35.4941388Z * [new branch] gh/guangyey/236/head -> origin/gh/guangyey/236/head 2025-12-04T12:46:35.4941459Z * [new branch] gh/guangyey/236/orig -> origin/gh/guangyey/236/orig 2025-12-04T12:46:35.4941530Z * [new branch] gh/guangyey/237/base -> origin/gh/guangyey/237/base 2025-12-04T12:46:35.4941601Z * [new branch] gh/guangyey/237/head -> origin/gh/guangyey/237/head 2025-12-04T12:46:35.4941677Z * [new branch] gh/guangyey/237/orig -> origin/gh/guangyey/237/orig 2025-12-04T12:46:35.4941748Z * [new branch] gh/guangyey/238/base -> origin/gh/guangyey/238/base 2025-12-04T12:46:35.4941822Z * [new branch] gh/guangyey/238/head -> origin/gh/guangyey/238/head 2025-12-04T12:46:35.4941894Z * [new branch] gh/guangyey/239/base -> origin/gh/guangyey/239/base 2025-12-04T12:46:35.4942003Z * [new branch] gh/guangyey/239/head -> origin/gh/guangyey/239/head 2025-12-04T12:46:35.4942077Z * [new branch] gh/guangyey/239/orig -> origin/gh/guangyey/239/orig 2025-12-04T12:46:35.4942148Z * [new branch] gh/guangyey/240/base -> origin/gh/guangyey/240/base 2025-12-04T12:46:35.4942246Z * [new branch] gh/guangyey/240/head -> origin/gh/guangyey/240/head 2025-12-04T12:46:35.4942318Z * [new branch] gh/guangyey/240/orig -> origin/gh/guangyey/240/orig 2025-12-04T12:46:35.4942388Z * [new branch] gh/guangyey/241/base -> origin/gh/guangyey/241/base 2025-12-04T12:46:35.4942458Z * [new branch] gh/guangyey/241/head -> origin/gh/guangyey/241/head 2025-12-04T12:46:35.4942528Z * [new branch] gh/guangyey/241/orig -> origin/gh/guangyey/241/orig 2025-12-04T12:46:35.4942598Z * [new branch] gh/guangyey/242/base -> origin/gh/guangyey/242/base 2025-12-04T12:46:35.4942670Z * [new branch] gh/guangyey/242/head -> origin/gh/guangyey/242/head 2025-12-04T12:46:35.4942741Z * [new branch] gh/guangyey/242/orig -> origin/gh/guangyey/242/orig 2025-12-04T12:46:35.4942811Z * [new branch] gh/guangyey/243/base -> origin/gh/guangyey/243/base 2025-12-04T12:46:35.4942883Z * [new branch] gh/guangyey/243/head -> origin/gh/guangyey/243/head 2025-12-04T12:46:35.4942954Z * [new branch] gh/guangyey/243/orig -> origin/gh/guangyey/243/orig 2025-12-04T12:46:35.4943024Z * [new branch] gh/guangyey/244/base -> origin/gh/guangyey/244/base 2025-12-04T12:46:35.4943093Z * [new branch] gh/guangyey/244/head -> origin/gh/guangyey/244/head 2025-12-04T12:46:35.4943165Z * [new branch] gh/guangyey/244/orig -> origin/gh/guangyey/244/orig 2025-12-04T12:46:35.4943235Z * [new branch] gh/guangyey/245/base -> origin/gh/guangyey/245/base 2025-12-04T12:46:35.4943308Z * [new branch] gh/guangyey/245/head -> origin/gh/guangyey/245/head 2025-12-04T12:46:35.4943380Z * [new branch] gh/guangyey/245/orig -> origin/gh/guangyey/245/orig 2025-12-04T12:46:35.4943452Z * [new branch] gh/guangyey/246/base -> origin/gh/guangyey/246/base 2025-12-04T12:46:35.4943525Z * [new branch] gh/guangyey/246/head -> origin/gh/guangyey/246/head 2025-12-04T12:46:35.4943594Z * [new branch] gh/guangyey/246/orig -> origin/gh/guangyey/246/orig 2025-12-04T12:46:35.4943665Z * [new branch] gh/guangyey/247/base -> origin/gh/guangyey/247/base 2025-12-04T12:46:35.4943740Z * [new branch] gh/guangyey/247/head -> origin/gh/guangyey/247/head 2025-12-04T12:46:35.4943812Z * [new branch] gh/guangyey/247/orig -> origin/gh/guangyey/247/orig 2025-12-04T12:46:35.4943885Z * [new branch] gh/guangyey/248/base -> origin/gh/guangyey/248/base 2025-12-04T12:46:35.4943959Z * [new branch] gh/guangyey/248/head -> origin/gh/guangyey/248/head 2025-12-04T12:46:35.4944029Z * [new branch] gh/guangyey/248/orig -> origin/gh/guangyey/248/orig 2025-12-04T12:46:35.4944099Z * [new branch] gh/guangyey/249/base -> origin/gh/guangyey/249/base 2025-12-04T12:46:35.4944172Z * [new branch] gh/guangyey/249/head -> origin/gh/guangyey/249/head 2025-12-04T12:46:35.4944243Z * [new branch] gh/guangyey/249/orig -> origin/gh/guangyey/249/orig 2025-12-04T12:46:35.4944313Z * [new branch] gh/guangyey/250/base -> origin/gh/guangyey/250/base 2025-12-04T12:46:35.4944388Z * [new branch] gh/guangyey/250/head -> origin/gh/guangyey/250/head 2025-12-04T12:46:35.4944459Z * [new branch] gh/guangyey/250/orig -> origin/gh/guangyey/250/orig 2025-12-04T12:46:35.4944555Z * [new branch] gh/guangyey/251/base -> origin/gh/guangyey/251/base 2025-12-04T12:46:35.4944628Z * [new branch] gh/guangyey/251/head -> origin/gh/guangyey/251/head 2025-12-04T12:46:35.4944698Z * [new branch] gh/guangyey/251/orig -> origin/gh/guangyey/251/orig 2025-12-04T12:46:35.4944767Z * [new branch] gh/guangyey/252/base -> origin/gh/guangyey/252/base 2025-12-04T12:46:35.4944863Z * [new branch] gh/guangyey/252/head -> origin/gh/guangyey/252/head 2025-12-04T12:46:35.4944934Z * [new branch] gh/guangyey/252/orig -> origin/gh/guangyey/252/orig 2025-12-04T12:46:35.4945006Z * [new branch] gh/guangyey/253/base -> origin/gh/guangyey/253/base 2025-12-04T12:46:35.4945076Z * [new branch] gh/guangyey/253/head -> origin/gh/guangyey/253/head 2025-12-04T12:46:35.4945146Z * [new branch] gh/guangyey/253/orig -> origin/gh/guangyey/253/orig 2025-12-04T12:46:35.4945219Z * [new branch] gh/guangyey/254/base -> origin/gh/guangyey/254/base 2025-12-04T12:46:35.4945289Z * [new branch] gh/guangyey/254/head -> origin/gh/guangyey/254/head 2025-12-04T12:46:35.4945359Z * [new branch] gh/guangyey/254/orig -> origin/gh/guangyey/254/orig 2025-12-04T12:46:35.4945431Z * [new branch] gh/guangyey/255/base -> origin/gh/guangyey/255/base 2025-12-04T12:46:35.4945503Z * [new branch] gh/guangyey/255/head -> origin/gh/guangyey/255/head 2025-12-04T12:46:35.4945573Z * [new branch] gh/guangyey/255/orig -> origin/gh/guangyey/255/orig 2025-12-04T12:46:35.4945645Z * [new branch] gh/guangyey/256/base -> origin/gh/guangyey/256/base 2025-12-04T12:46:35.4945715Z * [new branch] gh/guangyey/256/head -> origin/gh/guangyey/256/head 2025-12-04T12:46:35.4945785Z * [new branch] gh/guangyey/256/orig -> origin/gh/guangyey/256/orig 2025-12-04T12:46:35.4945885Z * [new branch] gh/guilhermeleobas/107/base -> origin/gh/guilhermeleobas/107/base 2025-12-04T12:46:35.4945977Z * [new branch] gh/guilhermeleobas/107/head -> origin/gh/guilhermeleobas/107/head 2025-12-04T12:46:35.4946066Z * [new branch] gh/guilhermeleobas/107/orig -> origin/gh/guilhermeleobas/107/orig 2025-12-04T12:46:35.4946157Z * [new branch] gh/guilhermeleobas/108/base -> origin/gh/guilhermeleobas/108/base 2025-12-04T12:46:35.4946246Z * [new branch] gh/guilhermeleobas/108/head -> origin/gh/guilhermeleobas/108/head 2025-12-04T12:46:35.4946333Z * [new branch] gh/guilhermeleobas/108/orig -> origin/gh/guilhermeleobas/108/orig 2025-12-04T12:46:35.4946423Z * [new branch] gh/guilhermeleobas/150/base -> origin/gh/guilhermeleobas/150/base 2025-12-04T12:46:35.4946510Z * [new branch] gh/guilhermeleobas/150/head -> origin/gh/guilhermeleobas/150/head 2025-12-04T12:46:35.4946601Z * [new branch] gh/guilhermeleobas/150/orig -> origin/gh/guilhermeleobas/150/orig 2025-12-04T12:46:35.4946688Z * [new branch] gh/guilhermeleobas/168/base -> origin/gh/guilhermeleobas/168/base 2025-12-04T12:46:35.4946775Z * [new branch] gh/guilhermeleobas/168/head -> origin/gh/guilhermeleobas/168/head 2025-12-04T12:46:35.4946866Z * [new branch] gh/guilhermeleobas/168/orig -> origin/gh/guilhermeleobas/168/orig 2025-12-04T12:46:35.4946956Z * [new branch] gh/guilhermeleobas/169/base -> origin/gh/guilhermeleobas/169/base 2025-12-04T12:46:35.4947045Z * [new branch] gh/guilhermeleobas/169/head -> origin/gh/guilhermeleobas/169/head 2025-12-04T12:46:35.4947135Z * [new branch] gh/guilhermeleobas/169/orig -> origin/gh/guilhermeleobas/169/orig 2025-12-04T12:46:35.4947225Z * [new branch] gh/guilhermeleobas/170/base -> origin/gh/guilhermeleobas/170/base 2025-12-04T12:46:35.4947340Z * [new branch] gh/guilhermeleobas/170/head -> origin/gh/guilhermeleobas/170/head 2025-12-04T12:46:35.4947430Z * [new branch] gh/guilhermeleobas/170/orig -> origin/gh/guilhermeleobas/170/orig 2025-12-04T12:46:35.4947559Z * [new branch] gh/guilhermeleobas/171/base -> origin/gh/guilhermeleobas/171/base 2025-12-04T12:46:35.4947648Z * [new branch] gh/guilhermeleobas/171/head -> origin/gh/guilhermeleobas/171/head 2025-12-04T12:46:35.4947774Z * [new branch] gh/guilhermeleobas/171/orig -> origin/gh/guilhermeleobas/171/orig 2025-12-04T12:46:35.4947861Z * [new branch] gh/guilhermeleobas/173/base -> origin/gh/guilhermeleobas/173/base 2025-12-04T12:46:35.4947950Z * [new branch] gh/guilhermeleobas/173/head -> origin/gh/guilhermeleobas/173/head 2025-12-04T12:46:35.4948039Z * [new branch] gh/guilhermeleobas/173/orig -> origin/gh/guilhermeleobas/173/orig 2025-12-04T12:46:35.4948128Z * [new branch] gh/guilhermeleobas/193/base -> origin/gh/guilhermeleobas/193/base 2025-12-04T12:46:35.4948216Z * [new branch] gh/guilhermeleobas/193/head -> origin/gh/guilhermeleobas/193/head 2025-12-04T12:46:35.4948303Z * [new branch] gh/guilhermeleobas/193/orig -> origin/gh/guilhermeleobas/193/orig 2025-12-04T12:46:35.4948390Z * [new branch] gh/guilhermeleobas/204/base -> origin/gh/guilhermeleobas/204/base 2025-12-04T12:46:35.4948480Z * [new branch] gh/guilhermeleobas/204/head -> origin/gh/guilhermeleobas/204/head 2025-12-04T12:46:35.4948566Z * [new branch] gh/guilhermeleobas/204/orig -> origin/gh/guilhermeleobas/204/orig 2025-12-04T12:46:35.4948654Z * [new branch] gh/guilhermeleobas/211/base -> origin/gh/guilhermeleobas/211/base 2025-12-04T12:46:35.4948743Z * [new branch] gh/guilhermeleobas/211/head -> origin/gh/guilhermeleobas/211/head 2025-12-04T12:46:35.4948830Z * [new branch] gh/guilhermeleobas/211/orig -> origin/gh/guilhermeleobas/211/orig 2025-12-04T12:46:35.4948919Z * [new branch] gh/guilhermeleobas/226/base -> origin/gh/guilhermeleobas/226/base 2025-12-04T12:46:35.4949007Z * [new branch] gh/guilhermeleobas/226/head -> origin/gh/guilhermeleobas/226/head 2025-12-04T12:46:35.4949094Z * [new branch] gh/guilhermeleobas/226/orig -> origin/gh/guilhermeleobas/226/orig 2025-12-04T12:46:35.4949182Z * [new branch] gh/guilhermeleobas/236/base -> origin/gh/guilhermeleobas/236/base 2025-12-04T12:46:35.4949271Z * [new branch] gh/guilhermeleobas/236/head -> origin/gh/guilhermeleobas/236/head 2025-12-04T12:46:35.4949358Z * [new branch] gh/guilhermeleobas/236/orig -> origin/gh/guilhermeleobas/236/orig 2025-12-04T12:46:35.4949447Z * [new branch] gh/guilhermeleobas/247/base -> origin/gh/guilhermeleobas/247/base 2025-12-04T12:46:35.4949534Z * [new branch] gh/guilhermeleobas/247/head -> origin/gh/guilhermeleobas/247/head 2025-12-04T12:46:35.4949622Z * [new branch] gh/guilhermeleobas/247/orig -> origin/gh/guilhermeleobas/247/orig 2025-12-04T12:46:35.4949711Z * [new branch] gh/guilhermeleobas/248/base -> origin/gh/guilhermeleobas/248/base 2025-12-04T12:46:35.4949798Z * [new branch] gh/guilhermeleobas/248/head -> origin/gh/guilhermeleobas/248/head 2025-12-04T12:46:35.4949886Z * [new branch] gh/guilhermeleobas/248/orig -> origin/gh/guilhermeleobas/248/orig 2025-12-04T12:46:35.4949974Z * [new branch] gh/guilhermeleobas/250/base -> origin/gh/guilhermeleobas/250/base 2025-12-04T12:46:35.4950061Z * [new branch] gh/guilhermeleobas/250/head -> origin/gh/guilhermeleobas/250/head 2025-12-04T12:46:35.4950148Z * [new branch] gh/guilhermeleobas/250/orig -> origin/gh/guilhermeleobas/250/orig 2025-12-04T12:46:35.4950236Z * [new branch] gh/guilhermeleobas/253/base -> origin/gh/guilhermeleobas/253/base 2025-12-04T12:46:35.4950372Z * [new branch] gh/guilhermeleobas/253/head -> origin/gh/guilhermeleobas/253/head 2025-12-04T12:46:35.4950461Z * [new branch] gh/guilhermeleobas/253/orig -> origin/gh/guilhermeleobas/253/orig 2025-12-04T12:46:35.4950547Z * [new branch] gh/guilhermeleobas/254/base -> origin/gh/guilhermeleobas/254/base 2025-12-04T12:46:35.4950659Z * [new branch] gh/guilhermeleobas/254/head -> origin/gh/guilhermeleobas/254/head 2025-12-04T12:46:35.4950747Z * [new branch] gh/guilhermeleobas/254/orig -> origin/gh/guilhermeleobas/254/orig 2025-12-04T12:46:35.4950833Z * [new branch] gh/guilhermeleobas/255/base -> origin/gh/guilhermeleobas/255/base 2025-12-04T12:46:35.4950919Z * [new branch] gh/guilhermeleobas/255/head -> origin/gh/guilhermeleobas/255/head 2025-12-04T12:46:35.4951007Z * [new branch] gh/guilhermeleobas/255/orig -> origin/gh/guilhermeleobas/255/orig 2025-12-04T12:46:35.4951096Z * [new branch] gh/guilhermeleobas/256/base -> origin/gh/guilhermeleobas/256/base 2025-12-04T12:46:35.4951182Z * [new branch] gh/guilhermeleobas/256/head -> origin/gh/guilhermeleobas/256/head 2025-12-04T12:46:35.4951270Z * [new branch] gh/guilhermeleobas/256/orig -> origin/gh/guilhermeleobas/256/orig 2025-12-04T12:46:35.4951359Z * [new branch] gh/guilhermeleobas/257/base -> origin/gh/guilhermeleobas/257/base 2025-12-04T12:46:35.4951445Z * [new branch] gh/guilhermeleobas/257/head -> origin/gh/guilhermeleobas/257/head 2025-12-04T12:46:35.4951532Z * [new branch] gh/guilhermeleobas/257/orig -> origin/gh/guilhermeleobas/257/orig 2025-12-04T12:46:35.4951619Z * [new branch] gh/guilhermeleobas/258/base -> origin/gh/guilhermeleobas/258/base 2025-12-04T12:46:35.4951707Z * [new branch] gh/guilhermeleobas/258/head -> origin/gh/guilhermeleobas/258/head 2025-12-04T12:46:35.4951795Z * [new branch] gh/guilhermeleobas/258/orig -> origin/gh/guilhermeleobas/258/orig 2025-12-04T12:46:35.4951882Z * [new branch] gh/guilhermeleobas/259/base -> origin/gh/guilhermeleobas/259/base 2025-12-04T12:46:35.4951970Z * [new branch] gh/guilhermeleobas/259/head -> origin/gh/guilhermeleobas/259/head 2025-12-04T12:46:35.4952059Z * [new branch] gh/guilhermeleobas/259/orig -> origin/gh/guilhermeleobas/259/orig 2025-12-04T12:46:35.4952146Z * [new branch] gh/guilhermeleobas/260/base -> origin/gh/guilhermeleobas/260/base 2025-12-04T12:46:35.4952233Z * [new branch] gh/guilhermeleobas/260/head -> origin/gh/guilhermeleobas/260/head 2025-12-04T12:46:35.4952320Z * [new branch] gh/guilhermeleobas/260/orig -> origin/gh/guilhermeleobas/260/orig 2025-12-04T12:46:35.4952407Z * [new branch] gh/guilhermeleobas/261/base -> origin/gh/guilhermeleobas/261/base 2025-12-04T12:46:35.4952496Z * [new branch] gh/guilhermeleobas/261/head -> origin/gh/guilhermeleobas/261/head 2025-12-04T12:46:35.4952582Z * [new branch] gh/guilhermeleobas/261/orig -> origin/gh/guilhermeleobas/261/orig 2025-12-04T12:46:35.4952669Z * [new branch] gh/guilhermeleobas/262/base -> origin/gh/guilhermeleobas/262/base 2025-12-04T12:46:35.4952756Z * [new branch] gh/guilhermeleobas/262/head -> origin/gh/guilhermeleobas/262/head 2025-12-04T12:46:35.4952845Z * [new branch] gh/guilhermeleobas/262/orig -> origin/gh/guilhermeleobas/262/orig 2025-12-04T12:46:35.4952932Z * [new branch] gh/guilhermeleobas/263/base -> origin/gh/guilhermeleobas/263/base 2025-12-04T12:46:35.4953020Z * [new branch] gh/guilhermeleobas/263/head -> origin/gh/guilhermeleobas/263/head 2025-12-04T12:46:35.4953107Z * [new branch] gh/guilhermeleobas/263/orig -> origin/gh/guilhermeleobas/263/orig 2025-12-04T12:46:35.4953220Z * [new branch] gh/guilhermeleobas/264/base -> origin/gh/guilhermeleobas/264/base 2025-12-04T12:46:35.4953307Z * [new branch] gh/guilhermeleobas/264/head -> origin/gh/guilhermeleobas/264/head 2025-12-04T12:46:35.4953394Z * [new branch] gh/guilhermeleobas/264/orig -> origin/gh/guilhermeleobas/264/orig 2025-12-04T12:46:35.4953482Z * [new branch] gh/guilhermeleobas/265/base -> origin/gh/guilhermeleobas/265/base 2025-12-04T12:46:35.4953598Z * [new branch] gh/guilhermeleobas/265/head -> origin/gh/guilhermeleobas/265/head 2025-12-04T12:46:35.4953685Z * [new branch] gh/guilhermeleobas/265/orig -> origin/gh/guilhermeleobas/265/orig 2025-12-04T12:46:35.4953772Z * [new branch] gh/guilhermeleobas/266/base -> origin/gh/guilhermeleobas/266/base 2025-12-04T12:46:35.4953859Z * [new branch] gh/guilhermeleobas/266/head -> origin/gh/guilhermeleobas/266/head 2025-12-04T12:46:35.4953947Z * [new branch] gh/guilhermeleobas/266/orig -> origin/gh/guilhermeleobas/266/orig 2025-12-04T12:46:35.4954035Z * [new branch] gh/guilhermeleobas/267/base -> origin/gh/guilhermeleobas/267/base 2025-12-04T12:46:35.4954122Z * [new branch] gh/guilhermeleobas/267/head -> origin/gh/guilhermeleobas/267/head 2025-12-04T12:46:35.4954208Z * [new branch] gh/guilhermeleobas/267/orig -> origin/gh/guilhermeleobas/267/orig 2025-12-04T12:46:35.4954292Z * [new branch] gh/hameerabbasi/1/base -> origin/gh/hameerabbasi/1/base 2025-12-04T12:46:35.4954369Z * [new branch] gh/hameerabbasi/1/head -> origin/gh/hameerabbasi/1/head 2025-12-04T12:46:35.4954443Z * [new branch] gh/hameerabbasi/2/base -> origin/gh/hameerabbasi/2/base 2025-12-04T12:46:35.4954519Z * [new branch] gh/hameerabbasi/2/head -> origin/gh/hameerabbasi/2/head 2025-12-04T12:46:35.4954594Z * [new branch] gh/hameerabbasi/2/orig -> origin/gh/hameerabbasi/2/orig 2025-12-04T12:46:35.4954670Z * [new branch] gh/hameerabbasi/3/base -> origin/gh/hameerabbasi/3/base 2025-12-04T12:46:35.4954744Z * [new branch] gh/hameerabbasi/3/head -> origin/gh/hameerabbasi/3/head 2025-12-04T12:46:35.4954817Z * [new branch] gh/hameerabbasi/3/orig -> origin/gh/hameerabbasi/3/orig 2025-12-04T12:46:35.4954894Z * [new branch] gh/hameerabbasi/4/base -> origin/gh/hameerabbasi/4/base 2025-12-04T12:46:35.4954968Z * [new branch] gh/hameerabbasi/4/head -> origin/gh/hameerabbasi/4/head 2025-12-04T12:46:35.4955042Z * [new branch] gh/hameerabbasi/4/orig -> origin/gh/hameerabbasi/4/orig 2025-12-04T12:46:35.4955112Z * [new branch] gh/huydhn/1/next -> origin/gh/huydhn/1/next 2025-12-04T12:46:35.4955180Z * [new branch] gh/huydhn/2/next -> origin/gh/huydhn/2/next 2025-12-04T12:46:35.4955246Z * [new branch] gh/huydhn/3/next -> origin/gh/huydhn/3/next 2025-12-04T12:46:35.4955314Z * [new branch] gh/huydhn/4/next -> origin/gh/huydhn/4/next 2025-12-04T12:46:35.4955379Z * [new branch] gh/huydhn/5/next -> origin/gh/huydhn/5/next 2025-12-04T12:46:35.4955444Z * [new branch] gh/huydhn/6/next -> origin/gh/huydhn/6/next 2025-12-04T12:46:35.4955511Z * [new branch] gh/int3/97/base -> origin/gh/int3/97/base 2025-12-04T12:46:35.4955579Z * [new branch] gh/int3/97/head -> origin/gh/int3/97/head 2025-12-04T12:46:35.4955649Z * [new branch] gh/isuruf/101/base -> origin/gh/isuruf/101/base 2025-12-04T12:46:35.4955718Z * [new branch] gh/isuruf/101/head -> origin/gh/isuruf/101/head 2025-12-04T12:46:35.4955786Z * [new branch] gh/isuruf/146/base -> origin/gh/isuruf/146/base 2025-12-04T12:46:35.4955852Z * [new branch] gh/isuruf/146/head -> origin/gh/isuruf/146/head 2025-12-04T12:46:35.4955945Z * [new branch] gh/isuruf/146/orig -> origin/gh/isuruf/146/orig 2025-12-04T12:46:35.4956012Z * [new branch] gh/isuruf/158/base -> origin/gh/isuruf/158/base 2025-12-04T12:46:35.4956078Z * [new branch] gh/isuruf/158/head -> origin/gh/isuruf/158/head 2025-12-04T12:46:35.4956146Z * [new branch] gh/isuruf/159/base -> origin/gh/isuruf/159/base 2025-12-04T12:46:35.4956239Z * [new branch] gh/isuruf/159/head -> origin/gh/isuruf/159/head 2025-12-04T12:46:35.4956306Z * [new branch] gh/isuruf/160/base -> origin/gh/isuruf/160/base 2025-12-04T12:46:35.4956372Z * [new branch] gh/isuruf/160/head -> origin/gh/isuruf/160/head 2025-12-04T12:46:35.4956440Z * [new branch] gh/isuruf/160/orig -> origin/gh/isuruf/160/orig 2025-12-04T12:46:35.4956509Z * [new branch] gh/isuruf/81/base -> origin/gh/isuruf/81/base 2025-12-04T12:46:35.4956578Z * [new branch] gh/isuruf/81/head -> origin/gh/isuruf/81/head 2025-12-04T12:46:35.4956645Z * [new branch] gh/isuruf/81/orig -> origin/gh/isuruf/81/orig 2025-12-04T12:46:35.4956718Z * [new branch] gh/jamesjwu/176/base -> origin/gh/jamesjwu/176/base 2025-12-04T12:46:35.4956790Z * [new branch] gh/jamesjwu/176/head -> origin/gh/jamesjwu/176/head 2025-12-04T12:46:35.4956862Z * [new branch] gh/jamesjwu/176/orig -> origin/gh/jamesjwu/176/orig 2025-12-04T12:46:35.4956933Z * [new branch] gh/jamesjwu/187/base -> origin/gh/jamesjwu/187/base 2025-12-04T12:46:35.4957003Z * [new branch] gh/jamesjwu/187/head -> origin/gh/jamesjwu/187/head 2025-12-04T12:46:35.4957072Z * [new branch] gh/jamesjwu/187/orig -> origin/gh/jamesjwu/187/orig 2025-12-04T12:46:35.4957144Z * [new branch] gh/jamesjwu/196/base -> origin/gh/jamesjwu/196/base 2025-12-04T12:46:35.4957214Z * [new branch] gh/jamesjwu/196/head -> origin/gh/jamesjwu/196/head 2025-12-04T12:46:35.4957284Z * [new branch] gh/jamesjwu/196/orig -> origin/gh/jamesjwu/196/orig 2025-12-04T12:46:35.4957355Z * [new branch] gh/jamesjwu/198/base -> origin/gh/jamesjwu/198/base 2025-12-04T12:46:35.4957424Z * [new branch] gh/jamesjwu/198/head -> origin/gh/jamesjwu/198/head 2025-12-04T12:46:35.4957532Z * [new branch] gh/jamesjwu/198/orig -> origin/gh/jamesjwu/198/orig 2025-12-04T12:46:35.4957605Z * [new branch] gh/jamesjwu/207/base -> origin/gh/jamesjwu/207/base 2025-12-04T12:46:35.4957674Z * [new branch] gh/jamesjwu/207/head -> origin/gh/jamesjwu/207/head 2025-12-04T12:46:35.4957744Z * [new branch] gh/jamesjwu/207/orig -> origin/gh/jamesjwu/207/orig 2025-12-04T12:46:35.4957816Z * [new branch] gh/jamesjwu/208/base -> origin/gh/jamesjwu/208/base 2025-12-04T12:46:35.4957886Z * [new branch] gh/jamesjwu/208/head -> origin/gh/jamesjwu/208/head 2025-12-04T12:46:35.4957958Z * [new branch] gh/jamesjwu/208/orig -> origin/gh/jamesjwu/208/orig 2025-12-04T12:46:35.4958030Z * [new branch] gh/jamesjwu/52/base -> origin/gh/jamesjwu/52/base 2025-12-04T12:46:35.4958100Z * [new branch] gh/jamesjwu/52/head -> origin/gh/jamesjwu/52/head 2025-12-04T12:46:35.4958173Z * [new branch] gh/jamesjwu/53/base -> origin/gh/jamesjwu/53/base 2025-12-04T12:46:35.4958242Z * [new branch] gh/jamesjwu/53/head -> origin/gh/jamesjwu/53/head 2025-12-04T12:46:35.4958311Z * [new branch] gh/jamesjwu/54/base -> origin/gh/jamesjwu/54/base 2025-12-04T12:46:35.4958379Z * [new branch] gh/jamesjwu/54/head -> origin/gh/jamesjwu/54/head 2025-12-04T12:46:35.4958447Z * [new branch] gh/jamesjwu/55/base -> origin/gh/jamesjwu/55/base 2025-12-04T12:46:35.4958550Z * [new branch] gh/jamesjwu/55/head -> origin/gh/jamesjwu/55/head 2025-12-04T12:46:35.4958620Z * [new branch] gh/jamesjwu/56/base -> origin/gh/jamesjwu/56/base 2025-12-04T12:46:35.4958689Z * [new branch] gh/jamesjwu/56/head -> origin/gh/jamesjwu/56/head 2025-12-04T12:46:35.4958757Z * [new branch] gh/jamesjwu/57/base -> origin/gh/jamesjwu/57/base 2025-12-04T12:46:35.4958874Z * [new branch] gh/jamesjwu/57/head -> origin/gh/jamesjwu/57/head 2025-12-04T12:46:35.4958943Z * [new branch] gh/jamesjwu/58/base -> origin/gh/jamesjwu/58/base 2025-12-04T12:46:35.4959011Z * [new branch] gh/jamesjwu/58/head -> origin/gh/jamesjwu/58/head 2025-12-04T12:46:35.4959081Z * [new branch] gh/jamesjwu/59/base -> origin/gh/jamesjwu/59/base 2025-12-04T12:46:35.4959149Z * [new branch] gh/jamesjwu/59/head -> origin/gh/jamesjwu/59/head 2025-12-04T12:46:35.4959219Z * [new branch] gh/jamesjwu/60/base -> origin/gh/jamesjwu/60/base 2025-12-04T12:46:35.4959289Z * [new branch] gh/jamesjwu/60/head -> origin/gh/jamesjwu/60/head 2025-12-04T12:46:35.4959357Z * [new branch] gh/jamesjwu/61/base -> origin/gh/jamesjwu/61/base 2025-12-04T12:46:35.4959425Z * [new branch] gh/jamesjwu/61/head -> origin/gh/jamesjwu/61/head 2025-12-04T12:46:35.4959497Z * [new branch] gh/jamesjwu/62/base -> origin/gh/jamesjwu/62/base 2025-12-04T12:46:35.4959565Z * [new branch] gh/jamesjwu/62/head -> origin/gh/jamesjwu/62/head 2025-12-04T12:46:35.4959634Z * [new branch] gh/jamesjwu/63/base -> origin/gh/jamesjwu/63/base 2025-12-04T12:46:35.4959702Z * [new branch] gh/jamesjwu/63/head -> origin/gh/jamesjwu/63/head 2025-12-04T12:46:35.4959771Z * [new branch] gh/jamesjwu/64/base -> origin/gh/jamesjwu/64/base 2025-12-04T12:46:35.4959842Z * [new branch] gh/jamesjwu/64/head -> origin/gh/jamesjwu/64/head 2025-12-04T12:46:35.4959910Z * [new branch] gh/jamesjwu/65/base -> origin/gh/jamesjwu/65/base 2025-12-04T12:46:35.4959978Z * [new branch] gh/jamesjwu/65/head -> origin/gh/jamesjwu/65/head 2025-12-04T12:46:35.4960051Z * [new branch] gh/janeyx99/165/base -> origin/gh/janeyx99/165/base 2025-12-04T12:46:35.4960120Z * [new branch] gh/janeyx99/165/head -> origin/gh/janeyx99/165/head 2025-12-04T12:46:35.4960190Z * [new branch] gh/janeyx99/165/orig -> origin/gh/janeyx99/165/orig 2025-12-04T12:46:35.4960260Z * [new branch] gh/janeyx99/201/base -> origin/gh/janeyx99/201/base 2025-12-04T12:46:35.4960328Z * [new branch] gh/janeyx99/201/head -> origin/gh/janeyx99/201/head 2025-12-04T12:46:35.4960397Z * [new branch] gh/janeyx99/201/orig -> origin/gh/janeyx99/201/orig 2025-12-04T12:46:35.4960468Z * [new branch] gh/janeyx99/225/base -> origin/gh/janeyx99/225/base 2025-12-04T12:46:35.4960537Z * [new branch] gh/janeyx99/225/head -> origin/gh/janeyx99/225/head 2025-12-04T12:46:35.4960606Z * [new branch] gh/janeyx99/225/orig -> origin/gh/janeyx99/225/orig 2025-12-04T12:46:35.4960678Z * [new branch] gh/janeyx99/299/base -> origin/gh/janeyx99/299/base 2025-12-04T12:46:35.4960747Z * [new branch] gh/janeyx99/299/head -> origin/gh/janeyx99/299/head 2025-12-04T12:46:35.4960816Z * [new branch] gh/janeyx99/299/orig -> origin/gh/janeyx99/299/orig 2025-12-04T12:46:35.4960885Z * [new branch] gh/janeyx99/302/base -> origin/gh/janeyx99/302/base 2025-12-04T12:46:35.4960954Z * [new branch] gh/janeyx99/302/head -> origin/gh/janeyx99/302/head 2025-12-04T12:46:35.4961023Z * [new branch] gh/janeyx99/303/base -> origin/gh/janeyx99/303/base 2025-12-04T12:46:35.4961117Z * [new branch] gh/janeyx99/303/head -> origin/gh/janeyx99/303/head 2025-12-04T12:46:35.4961186Z * [new branch] gh/janeyx99/305/base -> origin/gh/janeyx99/305/base 2025-12-04T12:46:35.4961256Z * [new branch] gh/janeyx99/305/head -> origin/gh/janeyx99/305/head 2025-12-04T12:46:35.4961346Z * [new branch] gh/janeyx99/306/base -> origin/gh/janeyx99/306/base 2025-12-04T12:46:35.4961415Z * [new branch] gh/janeyx99/306/head -> origin/gh/janeyx99/306/head 2025-12-04T12:46:35.4961484Z * [new branch] gh/janeyx99/314/base -> origin/gh/janeyx99/314/base 2025-12-04T12:46:35.4961552Z * [new branch] gh/janeyx99/314/head -> origin/gh/janeyx99/314/head 2025-12-04T12:46:35.4961620Z * [new branch] gh/janeyx99/314/orig -> origin/gh/janeyx99/314/orig 2025-12-04T12:46:35.4961690Z * [new branch] gh/janeyx99/315/base -> origin/gh/janeyx99/315/base 2025-12-04T12:46:35.4961760Z * [new branch] gh/janeyx99/315/head -> origin/gh/janeyx99/315/head 2025-12-04T12:46:35.4961829Z * [new branch] gh/janeyx99/315/orig -> origin/gh/janeyx99/315/orig 2025-12-04T12:46:35.4961898Z * [new branch] gh/janeyx99/316/base -> origin/gh/janeyx99/316/base 2025-12-04T12:46:35.4961968Z * [new branch] gh/janeyx99/316/head -> origin/gh/janeyx99/316/head 2025-12-04T12:46:35.4962037Z * [new branch] gh/janeyx99/316/orig -> origin/gh/janeyx99/316/orig 2025-12-04T12:46:35.4962106Z * [new branch] gh/janeyx99/317/base -> origin/gh/janeyx99/317/base 2025-12-04T12:46:35.4962175Z * [new branch] gh/janeyx99/317/head -> origin/gh/janeyx99/317/head 2025-12-04T12:46:35.4962244Z * [new branch] gh/janeyx99/317/orig -> origin/gh/janeyx99/317/orig 2025-12-04T12:46:35.4962315Z * [new branch] gh/janeyx99/325/base -> origin/gh/janeyx99/325/base 2025-12-04T12:46:35.4962386Z * [new branch] gh/janeyx99/325/head -> origin/gh/janeyx99/325/head 2025-12-04T12:46:35.4962454Z * [new branch] gh/janeyx99/325/orig -> origin/gh/janeyx99/325/orig 2025-12-04T12:46:35.4962524Z * [new branch] gh/janeyx99/327/base -> origin/gh/janeyx99/327/base 2025-12-04T12:46:35.4962598Z * [new branch] gh/janeyx99/327/head -> origin/gh/janeyx99/327/head 2025-12-04T12:46:35.4962667Z * [new branch] gh/janeyx99/327/orig -> origin/gh/janeyx99/327/orig 2025-12-04T12:46:35.4962736Z * [new branch] gh/janeyx99/328/base -> origin/gh/janeyx99/328/base 2025-12-04T12:46:35.4962805Z * [new branch] gh/janeyx99/328/head -> origin/gh/janeyx99/328/head 2025-12-04T12:46:35.4962874Z * [new branch] gh/janeyx99/328/orig -> origin/gh/janeyx99/328/orig 2025-12-04T12:46:35.4962943Z * [new branch] gh/janeyx99/329/base -> origin/gh/janeyx99/329/base 2025-12-04T12:46:35.4963012Z * [new branch] gh/janeyx99/329/head -> origin/gh/janeyx99/329/head 2025-12-04T12:46:35.4963082Z * [new branch] gh/janeyx99/329/orig -> origin/gh/janeyx99/329/orig 2025-12-04T12:46:35.4963150Z * [new branch] gh/janeyx99/330/base -> origin/gh/janeyx99/330/base 2025-12-04T12:46:35.4963219Z * [new branch] gh/janeyx99/330/head -> origin/gh/janeyx99/330/head 2025-12-04T12:46:35.4963289Z * [new branch] gh/janeyx99/330/orig -> origin/gh/janeyx99/330/orig 2025-12-04T12:46:35.4963357Z * [new branch] gh/janeyx99/331/base -> origin/gh/janeyx99/331/base 2025-12-04T12:46:35.4963425Z * [new branch] gh/janeyx99/331/head -> origin/gh/janeyx99/331/head 2025-12-04T12:46:35.4963495Z * [new branch] gh/janeyx99/331/orig -> origin/gh/janeyx99/331/orig 2025-12-04T12:46:35.4963588Z * [new branch] gh/janeyx99/332/base -> origin/gh/janeyx99/332/base 2025-12-04T12:46:35.4963657Z * [new branch] gh/janeyx99/332/head -> origin/gh/janeyx99/332/head 2025-12-04T12:46:35.4963727Z * [new branch] gh/janeyx99/332/orig -> origin/gh/janeyx99/332/orig 2025-12-04T12:46:35.4963795Z * [new branch] gh/janeyx99/333/base -> origin/gh/janeyx99/333/base 2025-12-04T12:46:35.4963892Z * [new branch] gh/janeyx99/333/head -> origin/gh/janeyx99/333/head 2025-12-04T12:46:35.4963962Z * [new branch] gh/janeyx99/333/orig -> origin/gh/janeyx99/333/orig 2025-12-04T12:46:35.4964030Z * [new branch] gh/janeyx99/88/base -> origin/gh/janeyx99/88/base 2025-12-04T12:46:35.4964099Z * [new branch] gh/janeyx99/88/head -> origin/gh/janeyx99/88/head 2025-12-04T12:46:35.4964167Z * [new branch] gh/janeyx99/88/orig -> origin/gh/janeyx99/88/orig 2025-12-04T12:46:35.4964237Z * [new branch] gh/jansel/360/base -> origin/gh/jansel/360/base 2025-12-04T12:46:35.4964306Z * [new branch] gh/jansel/360/head -> origin/gh/jansel/360/head 2025-12-04T12:46:35.4964374Z * [new branch] gh/jansel/451/base -> origin/gh/jansel/451/base 2025-12-04T12:46:35.4964441Z * [new branch] gh/jansel/451/head -> origin/gh/jansel/451/head 2025-12-04T12:46:35.4964512Z * [new branch] gh/jansel/451/orig -> origin/gh/jansel/451/orig 2025-12-04T12:46:35.4964578Z * [new branch] gh/jansel/462/base -> origin/gh/jansel/462/base 2025-12-04T12:46:35.4964644Z * [new branch] gh/jansel/462/head -> origin/gh/jansel/462/head 2025-12-04T12:46:35.4964712Z * [new branch] gh/jansel/462/orig -> origin/gh/jansel/462/orig 2025-12-04T12:46:35.4964779Z * [new branch] gh/jansel/533/base -> origin/gh/jansel/533/base 2025-12-04T12:46:35.4964846Z * [new branch] gh/jansel/533/head -> origin/gh/jansel/533/head 2025-12-04T12:46:35.4964915Z * [new branch] gh/jansel/533/orig -> origin/gh/jansel/533/orig 2025-12-04T12:46:35.4964983Z * [new branch] gh/jansel/552/base -> origin/gh/jansel/552/base 2025-12-04T12:46:35.4965051Z * [new branch] gh/jansel/552/head -> origin/gh/jansel/552/head 2025-12-04T12:46:35.4965121Z * [new branch] gh/jansel/552/orig -> origin/gh/jansel/552/orig 2025-12-04T12:46:35.4965188Z * [new branch] gh/jansel/553/base -> origin/gh/jansel/553/base 2025-12-04T12:46:35.4965255Z * [new branch] gh/jansel/553/head -> origin/gh/jansel/553/head 2025-12-04T12:46:35.4965323Z * [new branch] gh/jansel/553/orig -> origin/gh/jansel/553/orig 2025-12-04T12:46:35.4965390Z * [new branch] gh/jansel/554/base -> origin/gh/jansel/554/base 2025-12-04T12:46:35.4965458Z * [new branch] gh/jansel/554/head -> origin/gh/jansel/554/head 2025-12-04T12:46:35.4965528Z * [new branch] gh/jansel/554/orig -> origin/gh/jansel/554/orig 2025-12-04T12:46:35.4965595Z * [new branch] gh/jansel/555/base -> origin/gh/jansel/555/base 2025-12-04T12:46:35.4965662Z * [new branch] gh/jansel/555/head -> origin/gh/jansel/555/head 2025-12-04T12:46:35.4965731Z * [new branch] gh/jansel/555/orig -> origin/gh/jansel/555/orig 2025-12-04T12:46:35.4965797Z * [new branch] gh/jansel/556/base -> origin/gh/jansel/556/base 2025-12-04T12:46:35.4965865Z * [new branch] gh/jansel/556/head -> origin/gh/jansel/556/head 2025-12-04T12:46:35.4965932Z * [new branch] gh/jansel/556/orig -> origin/gh/jansel/556/orig 2025-12-04T12:46:35.4965998Z * [new branch] gh/jansel/557/base -> origin/gh/jansel/557/base 2025-12-04T12:46:35.4966066Z * [new branch] gh/jansel/557/head -> origin/gh/jansel/557/head 2025-12-04T12:46:35.4966164Z * [new branch] gh/jansel/557/orig -> origin/gh/jansel/557/orig 2025-12-04T12:46:35.4966231Z * [new branch] gh/jansel/558/base -> origin/gh/jansel/558/base 2025-12-04T12:46:35.4966299Z * [new branch] gh/jansel/558/head -> origin/gh/jansel/558/head 2025-12-04T12:46:35.4966390Z * [new branch] gh/jansel/558/orig -> origin/gh/jansel/558/orig 2025-12-04T12:46:35.4966457Z * [new branch] gh/jansel/559/base -> origin/gh/jansel/559/base 2025-12-04T12:46:35.4966525Z * [new branch] gh/jansel/559/head -> origin/gh/jansel/559/head 2025-12-04T12:46:35.4966592Z * [new branch] gh/jansel/559/orig -> origin/gh/jansel/559/orig 2025-12-04T12:46:35.4966659Z * [new branch] gh/jansel/560/base -> origin/gh/jansel/560/base 2025-12-04T12:46:35.4966727Z * [new branch] gh/jansel/560/head -> origin/gh/jansel/560/head 2025-12-04T12:46:35.4966795Z * [new branch] gh/jansel/560/orig -> origin/gh/jansel/560/orig 2025-12-04T12:46:35.4966862Z * [new branch] gh/jansel/561/base -> origin/gh/jansel/561/base 2025-12-04T12:46:35.4966929Z * [new branch] gh/jansel/561/head -> origin/gh/jansel/561/head 2025-12-04T12:46:35.4966998Z * [new branch] gh/jansel/561/orig -> origin/gh/jansel/561/orig 2025-12-04T12:46:35.4967065Z * [new branch] gh/jansel/562/base -> origin/gh/jansel/562/base 2025-12-04T12:46:35.4967131Z * [new branch] gh/jansel/562/head -> origin/gh/jansel/562/head 2025-12-04T12:46:35.4967197Z * [new branch] gh/jansel/562/orig -> origin/gh/jansel/562/orig 2025-12-04T12:46:35.4967264Z * [new branch] gh/jansel/563/base -> origin/gh/jansel/563/base 2025-12-04T12:46:35.4967331Z * [new branch] gh/jansel/563/head -> origin/gh/jansel/563/head 2025-12-04T12:46:35.4967399Z * [new branch] gh/jansel/563/orig -> origin/gh/jansel/563/orig 2025-12-04T12:46:35.4967466Z * [new branch] gh/jansel/564/base -> origin/gh/jansel/564/base 2025-12-04T12:46:35.4967572Z * [new branch] gh/jansel/564/head -> origin/gh/jansel/564/head 2025-12-04T12:46:35.4967643Z * [new branch] gh/jansel/564/orig -> origin/gh/jansel/564/orig 2025-12-04T12:46:35.4967711Z * [new branch] gh/jansel/565/base -> origin/gh/jansel/565/base 2025-12-04T12:46:35.4967778Z * [new branch] gh/jansel/565/head -> origin/gh/jansel/565/head 2025-12-04T12:46:35.4967845Z * [new branch] gh/jansel/565/orig -> origin/gh/jansel/565/orig 2025-12-04T12:46:35.4967913Z * [new branch] gh/jansel/566/base -> origin/gh/jansel/566/base 2025-12-04T12:46:35.4967980Z * [new branch] gh/jansel/566/head -> origin/gh/jansel/566/head 2025-12-04T12:46:35.4968047Z * [new branch] gh/jansel/566/orig -> origin/gh/jansel/566/orig 2025-12-04T12:46:35.4968115Z * [new branch] gh/jansel/567/base -> origin/gh/jansel/567/base 2025-12-04T12:46:35.4968182Z * [new branch] gh/jansel/567/head -> origin/gh/jansel/567/head 2025-12-04T12:46:35.4968250Z * [new branch] gh/jansel/567/orig -> origin/gh/jansel/567/orig 2025-12-04T12:46:35.4968317Z * [new branch] gh/jansel/568/base -> origin/gh/jansel/568/base 2025-12-04T12:46:35.4968384Z * [new branch] gh/jansel/568/head -> origin/gh/jansel/568/head 2025-12-04T12:46:35.4968451Z * [new branch] gh/jansel/568/orig -> origin/gh/jansel/568/orig 2025-12-04T12:46:35.4968520Z * [new branch] gh/jansel/569/base -> origin/gh/jansel/569/base 2025-12-04T12:46:35.4968589Z * [new branch] gh/jansel/569/head -> origin/gh/jansel/569/head 2025-12-04T12:46:35.4968693Z * [new branch] gh/jansel/569/orig -> origin/gh/jansel/569/orig 2025-12-04T12:46:35.4968764Z * [new branch] gh/jansel/570/base -> origin/gh/jansel/570/base 2025-12-04T12:46:35.4968831Z * [new branch] gh/jansel/570/head -> origin/gh/jansel/570/head 2025-12-04T12:46:35.4968966Z * [new branch] gh/jansel/570/orig -> origin/gh/jansel/570/orig 2025-12-04T12:46:35.4969037Z * [new branch] gh/jansel/571/base -> origin/gh/jansel/571/base 2025-12-04T12:46:35.4969103Z * [new branch] gh/jansel/571/head -> origin/gh/jansel/571/head 2025-12-04T12:46:35.4969172Z * [new branch] gh/jansel/571/orig -> origin/gh/jansel/571/orig 2025-12-04T12:46:35.4969239Z * [new branch] gh/jansel/572/base -> origin/gh/jansel/572/base 2025-12-04T12:46:35.4969306Z * [new branch] gh/jansel/572/head -> origin/gh/jansel/572/head 2025-12-04T12:46:35.4969376Z * [new branch] gh/jansel/572/orig -> origin/gh/jansel/572/orig 2025-12-04T12:46:35.4969444Z * [new branch] gh/jansel/573/base -> origin/gh/jansel/573/base 2025-12-04T12:46:35.4969511Z * [new branch] gh/jansel/573/head -> origin/gh/jansel/573/head 2025-12-04T12:46:35.4969580Z * [new branch] gh/jansel/573/orig -> origin/gh/jansel/573/orig 2025-12-04T12:46:35.4969647Z * [new branch] gh/jansel/574/base -> origin/gh/jansel/574/base 2025-12-04T12:46:35.4969716Z * [new branch] gh/jansel/574/head -> origin/gh/jansel/574/head 2025-12-04T12:46:35.4969783Z * [new branch] gh/jansel/574/orig -> origin/gh/jansel/574/orig 2025-12-04T12:46:35.4969850Z * [new branch] gh/jansel/575/base -> origin/gh/jansel/575/base 2025-12-04T12:46:35.4969918Z * [new branch] gh/jansel/575/head -> origin/gh/jansel/575/head 2025-12-04T12:46:35.4969989Z * [new branch] gh/jansel/575/orig -> origin/gh/jansel/575/orig 2025-12-04T12:46:35.4970058Z * [new branch] gh/jansel/576/base -> origin/gh/jansel/576/base 2025-12-04T12:46:35.4970125Z * [new branch] gh/jansel/576/head -> origin/gh/jansel/576/head 2025-12-04T12:46:35.4970194Z * [new branch] gh/jansel/576/orig -> origin/gh/jansel/576/orig 2025-12-04T12:46:35.4970275Z * [new branch] gh/jbschlosser/247/base -> origin/gh/jbschlosser/247/base 2025-12-04T12:46:35.4970353Z * [new branch] gh/jbschlosser/247/head -> origin/gh/jbschlosser/247/head 2025-12-04T12:46:35.4970430Z * [new branch] gh/jbschlosser/247/orig -> origin/gh/jbschlosser/247/orig 2025-12-04T12:46:35.4970506Z * [new branch] gh/jbschlosser/250/base -> origin/gh/jbschlosser/250/base 2025-12-04T12:46:35.4970582Z * [new branch] gh/jbschlosser/250/head -> origin/gh/jbschlosser/250/head 2025-12-04T12:46:35.4970658Z * [new branch] gh/jbschlosser/250/orig -> origin/gh/jbschlosser/250/orig 2025-12-04T12:46:35.4970730Z * [new branch] gh/jerryzh168/1/base -> origin/gh/jerryzh168/1/base 2025-12-04T12:46:35.4970803Z * [new branch] gh/jerryzh168/1/head -> origin/gh/jerryzh168/1/head 2025-12-04T12:46:35.4970874Z * [new branch] gh/jerryzh168/1/orig -> origin/gh/jerryzh168/1/orig 2025-12-04T12:46:35.4970946Z * [new branch] gh/jiayisunx/59/base -> origin/gh/jiayisunx/59/base 2025-12-04T12:46:35.4971016Z * [new branch] gh/jiayisunx/59/head -> origin/gh/jiayisunx/59/head 2025-12-04T12:46:35.4971087Z * [new branch] gh/jiayisunx/59/orig -> origin/gh/jiayisunx/59/orig 2025-12-04T12:46:35.4971157Z * [new branch] gh/jiayisunx/61/base -> origin/gh/jiayisunx/61/base 2025-12-04T12:46:35.4971228Z * [new branch] gh/jiayisunx/61/head -> origin/gh/jiayisunx/61/head 2025-12-04T12:46:35.4971321Z * [new branch] gh/jiayisunx/61/orig -> origin/gh/jiayisunx/61/orig 2025-12-04T12:46:35.4971392Z * [new branch] gh/jiayisunx/68/base -> origin/gh/jiayisunx/68/base 2025-12-04T12:46:35.4971462Z * [new branch] gh/jiayisunx/68/head -> origin/gh/jiayisunx/68/head 2025-12-04T12:46:35.4971556Z * [new branch] gh/jiayisunx/68/orig -> origin/gh/jiayisunx/68/orig 2025-12-04T12:46:35.4971626Z * [new branch] gh/jiayisunx/77/base -> origin/gh/jiayisunx/77/base 2025-12-04T12:46:35.4971697Z * [new branch] gh/jiayisunx/77/head -> origin/gh/jiayisunx/77/head 2025-12-04T12:46:35.4971767Z * [new branch] gh/jiayisunx/77/orig -> origin/gh/jiayisunx/77/orig 2025-12-04T12:46:35.4971837Z * [new branch] gh/jiayisunx/78/base -> origin/gh/jiayisunx/78/base 2025-12-04T12:46:35.4971910Z * [new branch] gh/jiayisunx/78/head -> origin/gh/jiayisunx/78/head 2025-12-04T12:46:35.4971981Z * [new branch] gh/jiayisunx/78/orig -> origin/gh/jiayisunx/78/orig 2025-12-04T12:46:35.4972051Z * [new branch] gh/jiayisunx/79/base -> origin/gh/jiayisunx/79/base 2025-12-04T12:46:35.4972122Z * [new branch] gh/jiayisunx/79/head -> origin/gh/jiayisunx/79/head 2025-12-04T12:46:35.4972193Z * [new branch] gh/jiayisunx/79/orig -> origin/gh/jiayisunx/79/orig 2025-12-04T12:46:35.4972265Z * [new branch] gh/jiayisunx/82/base -> origin/gh/jiayisunx/82/base 2025-12-04T12:46:35.4972334Z * [new branch] gh/jiayisunx/82/head -> origin/gh/jiayisunx/82/head 2025-12-04T12:46:35.4972405Z * [new branch] gh/jiayisunx/82/orig -> origin/gh/jiayisunx/82/orig 2025-12-04T12:46:35.4972475Z * [new branch] gh/jiayisunx/83/base -> origin/gh/jiayisunx/83/base 2025-12-04T12:46:35.4972546Z * [new branch] gh/jiayisunx/83/head -> origin/gh/jiayisunx/83/head 2025-12-04T12:46:35.4972616Z * [new branch] gh/jiayisunx/83/orig -> origin/gh/jiayisunx/83/orig 2025-12-04T12:46:35.4972687Z * [new branch] gh/jiayisunx/84/base -> origin/gh/jiayisunx/84/base 2025-12-04T12:46:35.4972756Z * [new branch] gh/jiayisunx/84/head -> origin/gh/jiayisunx/84/head 2025-12-04T12:46:35.4972828Z * [new branch] gh/jiayisunx/84/orig -> origin/gh/jiayisunx/84/orig 2025-12-04T12:46:35.4972899Z * [new branch] gh/jiayisunx/85/base -> origin/gh/jiayisunx/85/base 2025-12-04T12:46:35.4972970Z * [new branch] gh/jiayisunx/85/head -> origin/gh/jiayisunx/85/head 2025-12-04T12:46:35.4973039Z * [new branch] gh/jiayisunx/85/orig -> origin/gh/jiayisunx/85/orig 2025-12-04T12:46:35.4973111Z * [new branch] gh/jiayisunx/86/base -> origin/gh/jiayisunx/86/base 2025-12-04T12:46:35.4973182Z * [new branch] gh/jiayisunx/86/head -> origin/gh/jiayisunx/86/head 2025-12-04T12:46:35.4973252Z * [new branch] gh/jiayisunx/86/orig -> origin/gh/jiayisunx/86/orig 2025-12-04T12:46:35.4973324Z * [new branch] gh/jiayisunx/87/base -> origin/gh/jiayisunx/87/base 2025-12-04T12:46:35.4973394Z * [new branch] gh/jiayisunx/87/head -> origin/gh/jiayisunx/87/head 2025-12-04T12:46:35.4973466Z * [new branch] gh/jiayisunx/87/orig -> origin/gh/jiayisunx/87/orig 2025-12-04T12:46:35.4973538Z * [new branch] gh/jiayisunx/88/base -> origin/gh/jiayisunx/88/base 2025-12-04T12:46:35.4973608Z * [new branch] gh/jiayisunx/88/head -> origin/gh/jiayisunx/88/head 2025-12-04T12:46:35.4973678Z * [new branch] gh/jiayisunx/88/orig -> origin/gh/jiayisunx/88/orig 2025-12-04T12:46:35.4973750Z * [new branch] gh/jiayisunx/89/base -> origin/gh/jiayisunx/89/base 2025-12-04T12:46:35.4973844Z * [new branch] gh/jiayisunx/89/head -> origin/gh/jiayisunx/89/head 2025-12-04T12:46:35.4973916Z * [new branch] gh/jiayisunx/89/orig -> origin/gh/jiayisunx/89/orig 2025-12-04T12:46:35.4973986Z * [new branch] gh/jiayisunx/90/base -> origin/gh/jiayisunx/90/base 2025-12-04T12:46:35.4974056Z * [new branch] gh/jiayisunx/90/head -> origin/gh/jiayisunx/90/head 2025-12-04T12:46:35.4974156Z * [new branch] gh/jiayisunx/90/orig -> origin/gh/jiayisunx/90/orig 2025-12-04T12:46:35.4974233Z * [new branch] gh/jjwu@meta.com/1/base -> origin/gh/jjwu@meta.com/1/base 2025-12-04T12:46:35.4974308Z * [new branch] gh/jjwu@meta.com/1/head -> origin/gh/jjwu@meta.com/1/head 2025-12-04T12:46:35.4974378Z * [new branch] gh/jturney/1/base -> origin/gh/jturney/1/base 2025-12-04T12:46:35.4974446Z * [new branch] gh/jturney/1/head -> origin/gh/jturney/1/head 2025-12-04T12:46:35.4974515Z * [new branch] gh/jturney/1/orig -> origin/gh/jturney/1/orig 2025-12-04T12:46:35.4974583Z * [new branch] gh/jturney/2/base -> origin/gh/jturney/2/base 2025-12-04T12:46:35.4974650Z * [new branch] gh/jturney/2/head -> origin/gh/jturney/2/head 2025-12-04T12:46:35.4974717Z * [new branch] gh/jturney/2/orig -> origin/gh/jturney/2/orig 2025-12-04T12:46:35.4974797Z * [new branch] gh/karthickai/10/base -> origin/gh/karthickai/10/base 2025-12-04T12:46:35.4974872Z * [new branch] gh/karthickai/10/head -> origin/gh/karthickai/10/head 2025-12-04T12:46:35.4974945Z * [new branch] gh/karthickai/10/orig -> origin/gh/karthickai/10/orig 2025-12-04T12:46:35.4975018Z * [new branch] gh/karthickai/11/base -> origin/gh/karthickai/11/base 2025-12-04T12:46:35.4975090Z * [new branch] gh/karthickai/11/head -> origin/gh/karthickai/11/head 2025-12-04T12:46:35.4975163Z * [new branch] gh/karthickai/11/orig -> origin/gh/karthickai/11/orig 2025-12-04T12:46:35.4975237Z * [new branch] gh/karthickai/12/base -> origin/gh/karthickai/12/base 2025-12-04T12:46:35.4975309Z * [new branch] gh/karthickai/12/head -> origin/gh/karthickai/12/head 2025-12-04T12:46:35.4975381Z * [new branch] gh/karthickai/12/orig -> origin/gh/karthickai/12/orig 2025-12-04T12:46:35.4975455Z * [new branch] gh/karthickai/13/base -> origin/gh/karthickai/13/base 2025-12-04T12:46:35.4975527Z * [new branch] gh/karthickai/13/head -> origin/gh/karthickai/13/head 2025-12-04T12:46:35.4975600Z * [new branch] gh/karthickai/13/orig -> origin/gh/karthickai/13/orig 2025-12-04T12:46:35.4975672Z * [new branch] gh/karthickai/14/base -> origin/gh/karthickai/14/base 2025-12-04T12:46:35.4975744Z * [new branch] gh/karthickai/14/head -> origin/gh/karthickai/14/head 2025-12-04T12:46:35.4975818Z * [new branch] gh/karthickai/14/orig -> origin/gh/karthickai/14/orig 2025-12-04T12:46:35.4975890Z * [new branch] gh/karthickai/15/base -> origin/gh/karthickai/15/base 2025-12-04T12:46:35.4975961Z * [new branch] gh/karthickai/15/head -> origin/gh/karthickai/15/head 2025-12-04T12:46:35.4976036Z * [new branch] gh/karthickai/15/orig -> origin/gh/karthickai/15/orig 2025-12-04T12:46:35.4976107Z * [new branch] gh/karthickai/16/base -> origin/gh/karthickai/16/base 2025-12-04T12:46:35.4976179Z * [new branch] gh/karthickai/16/head -> origin/gh/karthickai/16/head 2025-12-04T12:46:35.4976252Z * [new branch] gh/karthickai/16/orig -> origin/gh/karthickai/16/orig 2025-12-04T12:46:35.4976325Z * [new branch] gh/karthickai/17/base -> origin/gh/karthickai/17/base 2025-12-04T12:46:35.4976397Z * [new branch] gh/karthickai/17/head -> origin/gh/karthickai/17/head 2025-12-04T12:46:35.4976495Z * [new branch] gh/karthickai/17/orig -> origin/gh/karthickai/17/orig 2025-12-04T12:46:35.4976568Z * [new branch] gh/karthickai/18/base -> origin/gh/karthickai/18/base 2025-12-04T12:46:35.4976640Z * [new branch] gh/karthickai/18/head -> origin/gh/karthickai/18/head 2025-12-04T12:46:35.4976735Z * [new branch] gh/karthickai/18/orig -> origin/gh/karthickai/18/orig 2025-12-04T12:46:35.4976807Z * [new branch] gh/karthickai/19/base -> origin/gh/karthickai/19/base 2025-12-04T12:46:35.4976879Z * [new branch] gh/karthickai/19/head -> origin/gh/karthickai/19/head 2025-12-04T12:46:35.4976969Z * [new branch] gh/karthickai/19/orig -> origin/gh/karthickai/19/orig 2025-12-04T12:46:35.4977041Z * [new branch] gh/karthickai/20/base -> origin/gh/karthickai/20/base 2025-12-04T12:46:35.4977113Z * [new branch] gh/karthickai/20/head -> origin/gh/karthickai/20/head 2025-12-04T12:46:35.4977186Z * [new branch] gh/karthickai/20/orig -> origin/gh/karthickai/20/orig 2025-12-04T12:46:35.4977258Z * [new branch] gh/karthickai/21/base -> origin/gh/karthickai/21/base 2025-12-04T12:46:35.4977332Z * [new branch] gh/karthickai/21/head -> origin/gh/karthickai/21/head 2025-12-04T12:46:35.4977405Z * [new branch] gh/karthickai/21/orig -> origin/gh/karthickai/21/orig 2025-12-04T12:46:35.4977532Z * [new branch] gh/karthickai/22/base -> origin/gh/karthickai/22/base 2025-12-04T12:46:35.4977606Z * [new branch] gh/karthickai/22/head -> origin/gh/karthickai/22/head 2025-12-04T12:46:35.4977678Z * [new branch] gh/karthickai/22/orig -> origin/gh/karthickai/22/orig 2025-12-04T12:46:35.4977750Z * [new branch] gh/karthickai/23/base -> origin/gh/karthickai/23/base 2025-12-04T12:46:35.4977826Z * [new branch] gh/karthickai/23/head -> origin/gh/karthickai/23/head 2025-12-04T12:46:35.4977898Z * [new branch] gh/karthickai/23/orig -> origin/gh/karthickai/23/orig 2025-12-04T12:46:35.4977971Z * [new branch] gh/karthickai/24/base -> origin/gh/karthickai/24/base 2025-12-04T12:46:35.4978044Z * [new branch] gh/karthickai/24/head -> origin/gh/karthickai/24/head 2025-12-04T12:46:35.4978119Z * [new branch] gh/karthickai/24/orig -> origin/gh/karthickai/24/orig 2025-12-04T12:46:35.4978191Z * [new branch] gh/karthickai/25/base -> origin/gh/karthickai/25/base 2025-12-04T12:46:35.4978264Z * [new branch] gh/karthickai/25/head -> origin/gh/karthickai/25/head 2025-12-04T12:46:35.4978336Z * [new branch] gh/karthickai/25/orig -> origin/gh/karthickai/25/orig 2025-12-04T12:46:35.4978407Z * [new branch] gh/karthickai/26/base -> origin/gh/karthickai/26/base 2025-12-04T12:46:35.4978483Z * [new branch] gh/karthickai/26/head -> origin/gh/karthickai/26/head 2025-12-04T12:46:35.4978555Z * [new branch] gh/karthickai/26/orig -> origin/gh/karthickai/26/orig 2025-12-04T12:46:35.4978628Z * [new branch] gh/karthickai/6/base -> origin/gh/karthickai/6/base 2025-12-04T12:46:35.4978701Z * [new branch] gh/karthickai/6/head -> origin/gh/karthickai/6/head 2025-12-04T12:46:35.4978773Z * [new branch] gh/karthickai/6/orig -> origin/gh/karthickai/6/orig 2025-12-04T12:46:35.4978841Z * [new branch] gh/krocki/1/base -> origin/gh/krocki/1/base 2025-12-04T12:46:35.4978909Z * [new branch] gh/krocki/1/head -> origin/gh/krocki/1/head 2025-12-04T12:46:35.4978975Z * [new branch] gh/krocki/1/orig -> origin/gh/krocki/1/orig 2025-12-04T12:46:35.4979042Z * [new branch] gh/krocki/2/base -> origin/gh/krocki/2/base 2025-12-04T12:46:35.4979160Z * [new branch] gh/krocki/2/head -> origin/gh/krocki/2/head 2025-12-04T12:46:35.4979226Z * [new branch] gh/krocki/2/orig -> origin/gh/krocki/2/orig 2025-12-04T12:46:35.4979305Z * [new branch] gh/kurtamohler/60/base -> origin/gh/kurtamohler/60/base 2025-12-04T12:46:35.4979382Z * [new branch] gh/kurtamohler/60/head -> origin/gh/kurtamohler/60/head 2025-12-04T12:46:35.4979495Z * [new branch] gh/kurtamohler/60/orig -> origin/gh/kurtamohler/60/orig 2025-12-04T12:46:35.4979571Z * [new branch] gh/kurtamohler/61/base -> origin/gh/kurtamohler/61/base 2025-12-04T12:46:35.4979645Z * [new branch] gh/kurtamohler/61/head -> origin/gh/kurtamohler/61/head 2025-12-04T12:46:35.4979719Z * [new branch] gh/kurtamohler/61/orig -> origin/gh/kurtamohler/61/orig 2025-12-04T12:46:35.4979793Z * [new branch] gh/kurtamohler/62/base -> origin/gh/kurtamohler/62/base 2025-12-04T12:46:35.4979868Z * [new branch] gh/kurtamohler/62/head -> origin/gh/kurtamohler/62/head 2025-12-04T12:46:35.4979941Z * [new branch] gh/kurtamohler/62/orig -> origin/gh/kurtamohler/62/orig 2025-12-04T12:46:35.4980015Z * [new branch] gh/kurtamohler/63/base -> origin/gh/kurtamohler/63/base 2025-12-04T12:46:35.4980089Z * [new branch] gh/kurtamohler/63/head -> origin/gh/kurtamohler/63/head 2025-12-04T12:46:35.4980165Z * [new branch] gh/kurtamohler/63/orig -> origin/gh/kurtamohler/63/orig 2025-12-04T12:46:35.4980239Z * [new branch] gh/kurtamohler/64/base -> origin/gh/kurtamohler/64/base 2025-12-04T12:46:35.4980312Z * [new branch] gh/kurtamohler/64/head -> origin/gh/kurtamohler/64/head 2025-12-04T12:46:35.4980387Z * [new branch] gh/kurtamohler/64/orig -> origin/gh/kurtamohler/64/orig 2025-12-04T12:46:35.4980460Z * [new branch] gh/kurtamohler/65/base -> origin/gh/kurtamohler/65/base 2025-12-04T12:46:35.4980534Z * [new branch] gh/kurtamohler/65/head -> origin/gh/kurtamohler/65/head 2025-12-04T12:46:35.4980608Z * [new branch] gh/kurtamohler/65/orig -> origin/gh/kurtamohler/65/orig 2025-12-04T12:46:35.4980681Z * [new branch] gh/kurtamohler/66/base -> origin/gh/kurtamohler/66/base 2025-12-04T12:46:35.4980756Z * [new branch] gh/kurtamohler/66/head -> origin/gh/kurtamohler/66/head 2025-12-04T12:46:35.4980831Z * [new branch] gh/kurtamohler/66/orig -> origin/gh/kurtamohler/66/orig 2025-12-04T12:46:35.4980905Z * [new branch] gh/kurtamohler/67/base -> origin/gh/kurtamohler/67/base 2025-12-04T12:46:35.4980978Z * [new branch] gh/kurtamohler/67/head -> origin/gh/kurtamohler/67/head 2025-12-04T12:46:35.4981052Z * [new branch] gh/kurtamohler/67/orig -> origin/gh/kurtamohler/67/orig 2025-12-04T12:46:35.4981122Z * [new branch] gh/kwen2501/130/base -> origin/gh/kwen2501/130/base 2025-12-04T12:46:35.4981194Z * [new branch] gh/kwen2501/130/head -> origin/gh/kwen2501/130/head 2025-12-04T12:46:35.4981264Z * [new branch] gh/kwen2501/130/orig -> origin/gh/kwen2501/130/orig 2025-12-04T12:46:35.4981333Z * [new branch] gh/kwen2501/170/base -> origin/gh/kwen2501/170/base 2025-12-04T12:46:35.4981403Z * [new branch] gh/kwen2501/170/head -> origin/gh/kwen2501/170/head 2025-12-04T12:46:35.4981473Z * [new branch] gh/kwen2501/187/base -> origin/gh/kwen2501/187/base 2025-12-04T12:46:35.4981541Z * [new branch] gh/kwen2501/187/head -> origin/gh/kwen2501/187/head 2025-12-04T12:46:35.4981609Z * [new branch] gh/kwen2501/187/orig -> origin/gh/kwen2501/187/orig 2025-12-04T12:46:35.4981678Z * [new branch] gh/kwen2501/188/base -> origin/gh/kwen2501/188/base 2025-12-04T12:46:35.4981747Z * [new branch] gh/kwen2501/188/head -> origin/gh/kwen2501/188/head 2025-12-04T12:46:35.4981838Z * [new branch] gh/kwen2501/188/orig -> origin/gh/kwen2501/188/orig 2025-12-04T12:46:35.4981907Z * [new branch] gh/kwen2501/211/base -> origin/gh/kwen2501/211/base 2025-12-04T12:46:35.4981975Z * [new branch] gh/kwen2501/211/head -> origin/gh/kwen2501/211/head 2025-12-04T12:46:35.4982075Z * [new branch] gh/kwen2501/224/base -> origin/gh/kwen2501/224/base 2025-12-04T12:46:35.4982144Z * [new branch] gh/kwen2501/224/head -> origin/gh/kwen2501/224/head 2025-12-04T12:46:35.4982212Z * [new branch] gh/kwen2501/224/orig -> origin/gh/kwen2501/224/orig 2025-12-04T12:46:35.4982282Z * [new branch] gh/kwen2501/228/base -> origin/gh/kwen2501/228/base 2025-12-04T12:46:35.4982350Z * [new branch] gh/kwen2501/228/head -> origin/gh/kwen2501/228/head 2025-12-04T12:46:35.4982419Z * [new branch] gh/kwen2501/228/orig -> origin/gh/kwen2501/228/orig 2025-12-04T12:46:35.4982489Z * [new branch] gh/kwen2501/234/base -> origin/gh/kwen2501/234/base 2025-12-04T12:46:35.4982557Z * [new branch] gh/kwen2501/234/head -> origin/gh/kwen2501/234/head 2025-12-04T12:46:35.4982624Z * [new branch] gh/kwen2501/234/orig -> origin/gh/kwen2501/234/orig 2025-12-04T12:46:35.4982694Z * [new branch] gh/kwen2501/235/base -> origin/gh/kwen2501/235/base 2025-12-04T12:46:35.4982762Z * [new branch] gh/kwen2501/235/head -> origin/gh/kwen2501/235/head 2025-12-04T12:46:35.4982830Z * [new branch] gh/kwen2501/235/orig -> origin/gh/kwen2501/235/orig 2025-12-04T12:46:35.4982898Z * [new branch] gh/kwen2501/236/base -> origin/gh/kwen2501/236/base 2025-12-04T12:46:35.4982966Z * [new branch] gh/kwen2501/236/head -> origin/gh/kwen2501/236/head 2025-12-04T12:46:35.4983036Z * [new branch] gh/kwen2501/236/orig -> origin/gh/kwen2501/236/orig 2025-12-04T12:46:35.4983105Z * [new branch] gh/kwen2501/237/base -> origin/gh/kwen2501/237/base 2025-12-04T12:46:35.4983173Z * [new branch] gh/kwen2501/237/head -> origin/gh/kwen2501/237/head 2025-12-04T12:46:35.4983242Z * [new branch] gh/kwen2501/237/orig -> origin/gh/kwen2501/237/orig 2025-12-04T12:46:35.4983312Z * [new branch] gh/kwen2501/238/base -> origin/gh/kwen2501/238/base 2025-12-04T12:46:35.4983380Z * [new branch] gh/kwen2501/238/head -> origin/gh/kwen2501/238/head 2025-12-04T12:46:35.4983451Z * [new branch] gh/kwen2501/238/orig -> origin/gh/kwen2501/238/orig 2025-12-04T12:46:35.4983520Z * [new branch] gh/kwen2501/240/base -> origin/gh/kwen2501/240/base 2025-12-04T12:46:35.4983588Z * [new branch] gh/kwen2501/240/head -> origin/gh/kwen2501/240/head 2025-12-04T12:46:35.4983658Z * [new branch] gh/kwen2501/240/orig -> origin/gh/kwen2501/240/orig 2025-12-04T12:46:35.4983726Z * [new branch] gh/kwen2501/241/base -> origin/gh/kwen2501/241/base 2025-12-04T12:46:35.4983794Z * [new branch] gh/kwen2501/241/head -> origin/gh/kwen2501/241/head 2025-12-04T12:46:35.4983863Z * [new branch] gh/kwen2501/241/orig -> origin/gh/kwen2501/241/orig 2025-12-04T12:46:35.4983933Z * [new branch] gh/kwen2501/247/base -> origin/gh/kwen2501/247/base 2025-12-04T12:46:35.4984001Z * [new branch] gh/kwen2501/247/head -> origin/gh/kwen2501/247/head 2025-12-04T12:46:35.4984070Z * [new branch] gh/kwen2501/247/orig -> origin/gh/kwen2501/247/orig 2025-12-04T12:46:35.4984139Z * [new branch] gh/kwen2501/252/base -> origin/gh/kwen2501/252/base 2025-12-04T12:46:35.4984207Z * [new branch] gh/kwen2501/252/head -> origin/gh/kwen2501/252/head 2025-12-04T12:46:35.4984300Z * [new branch] gh/kwen2501/252/orig -> origin/gh/kwen2501/252/orig 2025-12-04T12:46:35.4984369Z * [new branch] gh/kwen2501/259/base -> origin/gh/kwen2501/259/base 2025-12-04T12:46:35.4984437Z * [new branch] gh/kwen2501/259/head -> origin/gh/kwen2501/259/head 2025-12-04T12:46:35.4984506Z * [new branch] gh/kwen2501/259/orig -> origin/gh/kwen2501/259/orig 2025-12-04T12:46:35.4984597Z * [new branch] gh/kwen2501/260/base -> origin/gh/kwen2501/260/base 2025-12-04T12:46:35.4984666Z * [new branch] gh/kwen2501/260/head -> origin/gh/kwen2501/260/head 2025-12-04T12:46:35.4984735Z * [new branch] gh/kwen2501/260/orig -> origin/gh/kwen2501/260/orig 2025-12-04T12:46:35.4984803Z * [new branch] gh/kwen2501/268/base -> origin/gh/kwen2501/268/base 2025-12-04T12:46:35.4984874Z * [new branch] gh/kwen2501/268/head -> origin/gh/kwen2501/268/head 2025-12-04T12:46:35.4984945Z * [new branch] gh/kwen2501/268/orig -> origin/gh/kwen2501/268/orig 2025-12-04T12:46:35.4985012Z * [new branch] gh/kwen2501/269/base -> origin/gh/kwen2501/269/base 2025-12-04T12:46:35.4985081Z * [new branch] gh/kwen2501/269/head -> origin/gh/kwen2501/269/head 2025-12-04T12:46:35.4985148Z * [new branch] gh/kwen2501/269/orig -> origin/gh/kwen2501/269/orig 2025-12-04T12:46:35.4985218Z * [new branch] gh/kwen2501/270/base -> origin/gh/kwen2501/270/base 2025-12-04T12:46:35.4985288Z * [new branch] gh/kwen2501/270/head -> origin/gh/kwen2501/270/head 2025-12-04T12:46:35.4985356Z * [new branch] gh/kwen2501/270/orig -> origin/gh/kwen2501/270/orig 2025-12-04T12:46:35.4985423Z * [new branch] gh/kwen2501/271/base -> origin/gh/kwen2501/271/base 2025-12-04T12:46:35.4985493Z * [new branch] gh/kwen2501/271/head -> origin/gh/kwen2501/271/head 2025-12-04T12:46:35.4985562Z * [new branch] gh/kwen2501/271/orig -> origin/gh/kwen2501/271/orig 2025-12-04T12:46:35.4985630Z * [new branch] gh/kwen2501/274/base -> origin/gh/kwen2501/274/base 2025-12-04T12:46:35.4985700Z * [new branch] gh/kwen2501/274/head -> origin/gh/kwen2501/274/head 2025-12-04T12:46:35.4985770Z * [new branch] gh/kwen2501/274/orig -> origin/gh/kwen2501/274/orig 2025-12-04T12:46:35.4985838Z * [new branch] gh/kwen2501/275/base -> origin/gh/kwen2501/275/base 2025-12-04T12:46:35.4985907Z * [new branch] gh/kwen2501/275/head -> origin/gh/kwen2501/275/head 2025-12-04T12:46:35.4985975Z * [new branch] gh/kwen2501/275/orig -> origin/gh/kwen2501/275/orig 2025-12-04T12:46:35.4986043Z * [new branch] gh/kwen2501/276/base -> origin/gh/kwen2501/276/base 2025-12-04T12:46:35.4986113Z * [new branch] gh/kwen2501/276/head -> origin/gh/kwen2501/276/head 2025-12-04T12:46:35.4986182Z * [new branch] gh/kwen2501/276/orig -> origin/gh/kwen2501/276/orig 2025-12-04T12:46:35.4986251Z * [new branch] gh/kwen2501/277/base -> origin/gh/kwen2501/277/base 2025-12-04T12:46:35.4986319Z * [new branch] gh/kwen2501/277/head -> origin/gh/kwen2501/277/head 2025-12-04T12:46:35.4986388Z * [new branch] gh/kwen2501/277/orig -> origin/gh/kwen2501/277/orig 2025-12-04T12:46:35.4986457Z * [new branch] gh/kwen2501/278/base -> origin/gh/kwen2501/278/base 2025-12-04T12:46:35.4986525Z * [new branch] gh/kwen2501/278/head -> origin/gh/kwen2501/278/head 2025-12-04T12:46:35.4986593Z * [new branch] gh/kwen2501/278/orig -> origin/gh/kwen2501/278/orig 2025-12-04T12:46:35.4986662Z * [new branch] gh/kwen2501/279/base -> origin/gh/kwen2501/279/base 2025-12-04T12:46:35.4986731Z * [new branch] gh/kwen2501/279/head -> origin/gh/kwen2501/279/head 2025-12-04T12:46:35.4986821Z * [new branch] gh/kwen2501/279/orig -> origin/gh/kwen2501/279/orig 2025-12-04T12:46:35.4986892Z * [new branch] gh/kwen2501/280/base -> origin/gh/kwen2501/280/base 2025-12-04T12:46:35.4986959Z * [new branch] gh/kwen2501/280/head -> origin/gh/kwen2501/280/head 2025-12-04T12:46:35.4987050Z * [new branch] gh/kwen2501/280/orig -> origin/gh/kwen2501/280/orig 2025-12-04T12:46:35.4987119Z * [new branch] gh/kwen2501/281/base -> origin/gh/kwen2501/281/base 2025-12-04T12:46:35.4987187Z * [new branch] gh/kwen2501/281/head -> origin/gh/kwen2501/281/head 2025-12-04T12:46:35.4987255Z * [new branch] gh/kwen2501/281/orig -> origin/gh/kwen2501/281/orig 2025-12-04T12:46:35.4987324Z * [new branch] gh/kwen2501/282/base -> origin/gh/kwen2501/282/base 2025-12-04T12:46:35.4987392Z * [new branch] gh/kwen2501/282/head -> origin/gh/kwen2501/282/head 2025-12-04T12:46:35.4987461Z * [new branch] gh/kwen2501/282/orig -> origin/gh/kwen2501/282/orig 2025-12-04T12:46:35.4987572Z * [new branch] gh/kwen2501/283/base -> origin/gh/kwen2501/283/base 2025-12-04T12:46:35.4987642Z * [new branch] gh/kwen2501/283/head -> origin/gh/kwen2501/283/head 2025-12-04T12:46:35.4987714Z * [new branch] gh/kwen2501/283/orig -> origin/gh/kwen2501/283/orig 2025-12-04T12:46:35.4987783Z * [new branch] gh/kwen2501/284/base -> origin/gh/kwen2501/284/base 2025-12-04T12:46:35.4987851Z * [new branch] gh/kwen2501/284/head -> origin/gh/kwen2501/284/head 2025-12-04T12:46:35.4987920Z * [new branch] gh/kwen2501/284/orig -> origin/gh/kwen2501/284/orig 2025-12-04T12:46:35.4987988Z * [new branch] gh/kwen2501/285/base -> origin/gh/kwen2501/285/base 2025-12-04T12:46:35.4988056Z * [new branch] gh/kwen2501/285/head -> origin/gh/kwen2501/285/head 2025-12-04T12:46:35.4988125Z * [new branch] gh/kwen2501/285/orig -> origin/gh/kwen2501/285/orig 2025-12-04T12:46:35.4988193Z * [new branch] gh/kwen2501/286/base -> origin/gh/kwen2501/286/base 2025-12-04T12:46:35.4988261Z * [new branch] gh/kwen2501/286/head -> origin/gh/kwen2501/286/head 2025-12-04T12:46:35.4988332Z * [new branch] gh/kwen2501/286/orig -> origin/gh/kwen2501/286/orig 2025-12-04T12:46:35.4988399Z * [new branch] gh/kwen2501/287/base -> origin/gh/kwen2501/287/base 2025-12-04T12:46:35.4988467Z * [new branch] gh/kwen2501/287/head -> origin/gh/kwen2501/287/head 2025-12-04T12:46:35.4988536Z * [new branch] gh/kwen2501/287/orig -> origin/gh/kwen2501/287/orig 2025-12-04T12:46:35.4988604Z * [new branch] gh/kwen2501/288/base -> origin/gh/kwen2501/288/base 2025-12-04T12:46:35.4988673Z * [new branch] gh/kwen2501/288/head -> origin/gh/kwen2501/288/head 2025-12-04T12:46:35.4988741Z * [new branch] gh/kwen2501/288/orig -> origin/gh/kwen2501/288/orig 2025-12-04T12:46:35.4988817Z * [new branch] gh/laithsakka/251/base -> origin/gh/laithsakka/251/base 2025-12-04T12:46:35.4988890Z * [new branch] gh/laithsakka/251/head -> origin/gh/laithsakka/251/head 2025-12-04T12:46:35.4988965Z * [new branch] gh/laithsakka/251/orig -> origin/gh/laithsakka/251/orig 2025-12-04T12:46:35.4989039Z * [new branch] gh/laithsakka/276/base -> origin/gh/laithsakka/276/base 2025-12-04T12:46:35.4989113Z * [new branch] gh/laithsakka/276/head -> origin/gh/laithsakka/276/head 2025-12-04T12:46:35.4989186Z * [new branch] gh/laithsakka/276/orig -> origin/gh/laithsakka/276/orig 2025-12-04T12:46:35.4989259Z * [new branch] gh/laithsakka/28/base -> origin/gh/laithsakka/28/base 2025-12-04T12:46:35.4989379Z * [new branch] gh/laithsakka/29/base -> origin/gh/laithsakka/29/base 2025-12-04T12:46:35.4989452Z * [new branch] gh/laithsakka/30/base -> origin/gh/laithsakka/30/base 2025-12-04T12:46:35.4989525Z * [new branch] gh/laithsakka/30/head -> origin/gh/laithsakka/30/head 2025-12-04T12:46:35.4989599Z * [new branch] gh/laithsakka/31/base -> origin/gh/laithsakka/31/base 2025-12-04T12:46:35.4989712Z * [new branch] gh/laithsakka/31/head -> origin/gh/laithsakka/31/head 2025-12-04T12:46:35.4989786Z * [new branch] gh/laithsakka/313/base -> origin/gh/laithsakka/313/base 2025-12-04T12:46:35.4989861Z * [new branch] gh/laithsakka/313/head -> origin/gh/laithsakka/313/head 2025-12-04T12:46:35.4989933Z * [new branch] gh/laithsakka/313/orig -> origin/gh/laithsakka/313/orig 2025-12-04T12:46:35.4990006Z * [new branch] gh/laithsakka/316/base -> origin/gh/laithsakka/316/base 2025-12-04T12:46:35.4990081Z * [new branch] gh/laithsakka/316/head -> origin/gh/laithsakka/316/head 2025-12-04T12:46:35.4990154Z * [new branch] gh/laithsakka/316/orig -> origin/gh/laithsakka/316/orig 2025-12-04T12:46:35.4990226Z * [new branch] gh/laithsakka/317/base -> origin/gh/laithsakka/317/base 2025-12-04T12:46:35.4990300Z * [new branch] gh/laithsakka/317/head -> origin/gh/laithsakka/317/head 2025-12-04T12:46:35.4990375Z * [new branch] gh/laithsakka/317/orig -> origin/gh/laithsakka/317/orig 2025-12-04T12:46:35.4990447Z * [new branch] gh/laithsakka/319/base -> origin/gh/laithsakka/319/base 2025-12-04T12:46:35.4990521Z * [new branch] gh/laithsakka/319/head -> origin/gh/laithsakka/319/head 2025-12-04T12:46:35.4990593Z * [new branch] gh/laithsakka/319/orig -> origin/gh/laithsakka/319/orig 2025-12-04T12:46:35.4990666Z * [new branch] gh/laithsakka/32/base -> origin/gh/laithsakka/32/base 2025-12-04T12:46:35.4990742Z * [new branch] gh/laithsakka/32/head -> origin/gh/laithsakka/32/head 2025-12-04T12:46:35.4990814Z * [new branch] gh/laithsakka/320/base -> origin/gh/laithsakka/320/base 2025-12-04T12:46:35.4990887Z * [new branch] gh/laithsakka/320/head -> origin/gh/laithsakka/320/head 2025-12-04T12:46:35.4990961Z * [new branch] gh/laithsakka/320/orig -> origin/gh/laithsakka/320/orig 2025-12-04T12:46:35.4991034Z * [new branch] gh/laithsakka/321/base -> origin/gh/laithsakka/321/base 2025-12-04T12:46:35.4991107Z * [new branch] gh/laithsakka/321/head -> origin/gh/laithsakka/321/head 2025-12-04T12:46:35.4991179Z * [new branch] gh/laithsakka/321/orig -> origin/gh/laithsakka/321/orig 2025-12-04T12:46:35.4991251Z * [new branch] gh/laithsakka/322/base -> origin/gh/laithsakka/322/base 2025-12-04T12:46:35.4991324Z * [new branch] gh/laithsakka/322/head -> origin/gh/laithsakka/322/head 2025-12-04T12:46:35.4991397Z * [new branch] gh/laithsakka/322/orig -> origin/gh/laithsakka/322/orig 2025-12-04T12:46:35.4991470Z * [new branch] gh/laithsakka/323/base -> origin/gh/laithsakka/323/base 2025-12-04T12:46:35.4991543Z * [new branch] gh/laithsakka/323/head -> origin/gh/laithsakka/323/head 2025-12-04T12:46:35.4991617Z * [new branch] gh/laithsakka/323/orig -> origin/gh/laithsakka/323/orig 2025-12-04T12:46:35.4991689Z * [new branch] gh/laithsakka/324/base -> origin/gh/laithsakka/324/base 2025-12-04T12:46:35.4991763Z * [new branch] gh/laithsakka/324/head -> origin/gh/laithsakka/324/head 2025-12-04T12:46:35.4991835Z * [new branch] gh/laithsakka/324/orig -> origin/gh/laithsakka/324/orig 2025-12-04T12:46:35.4991907Z * [new branch] gh/laithsakka/325/base -> origin/gh/laithsakka/325/base 2025-12-04T12:46:35.4992005Z * [new branch] gh/laithsakka/325/head -> origin/gh/laithsakka/325/head 2025-12-04T12:46:35.4992078Z * [new branch] gh/laithsakka/325/orig -> origin/gh/laithsakka/325/orig 2025-12-04T12:46:35.4992150Z * [new branch] gh/laithsakka/326/base -> origin/gh/laithsakka/326/base 2025-12-04T12:46:35.4992224Z * [new branch] gh/laithsakka/326/head -> origin/gh/laithsakka/326/head 2025-12-04T12:46:35.4992322Z * [new branch] gh/laithsakka/326/orig -> origin/gh/laithsakka/326/orig 2025-12-04T12:46:35.4992395Z * [new branch] gh/laithsakka/327/base -> origin/gh/laithsakka/327/base 2025-12-04T12:46:35.4992467Z * [new branch] gh/laithsakka/327/head -> origin/gh/laithsakka/327/head 2025-12-04T12:46:35.4992539Z * [new branch] gh/laithsakka/327/orig -> origin/gh/laithsakka/327/orig 2025-12-04T12:46:35.4992613Z * [new branch] gh/laithsakka/328/base -> origin/gh/laithsakka/328/base 2025-12-04T12:46:35.4992686Z * [new branch] gh/laithsakka/328/head -> origin/gh/laithsakka/328/head 2025-12-04T12:46:35.4992759Z * [new branch] gh/laithsakka/328/orig -> origin/gh/laithsakka/328/orig 2025-12-04T12:46:35.4992830Z * [new branch] gh/liangel/4/base -> origin/gh/liangel/4/base 2025-12-04T12:46:35.4992899Z * [new branch] gh/liangel/4/head -> origin/gh/liangel/4/head 2025-12-04T12:46:35.4992968Z * [new branch] gh/liangel/4/orig -> origin/gh/liangel/4/orig 2025-12-04T12:46:35.4993045Z * [new branch] gh/lucaskabela/1/base -> origin/gh/lucaskabela/1/base 2025-12-04T12:46:35.4993119Z * [new branch] gh/lucaskabela/1/head -> origin/gh/lucaskabela/1/head 2025-12-04T12:46:35.4993183Z * [new branch] gh/lw/4/base -> origin/gh/lw/4/base 2025-12-04T12:46:35.4993248Z * [new branch] gh/lw/4/head -> origin/gh/lw/4/head 2025-12-04T12:46:35.4993310Z * [new branch] gh/lw/4/orig -> origin/gh/lw/4/orig 2025-12-04T12:46:35.4993372Z * [new branch] gh/lw/5/base -> origin/gh/lw/5/base 2025-12-04T12:46:35.4993434Z * [new branch] gh/lw/5/head -> origin/gh/lw/5/head 2025-12-04T12:46:35.4993494Z * [new branch] gh/lw/5/orig -> origin/gh/lw/5/orig 2025-12-04T12:46:35.4993555Z * [new branch] gh/lw/6/base -> origin/gh/lw/6/base 2025-12-04T12:46:35.4993616Z * [new branch] gh/lw/6/head -> origin/gh/lw/6/head 2025-12-04T12:46:35.4993676Z * [new branch] gh/lw/6/orig -> origin/gh/lw/6/orig 2025-12-04T12:46:35.4993743Z * [new branch] gh/malfet/14/base -> origin/gh/malfet/14/base 2025-12-04T12:46:35.4993814Z * [new branch] gh/malfet/417/base -> origin/gh/malfet/417/base 2025-12-04T12:46:35.4993883Z * [new branch] gh/malfet/417/head -> origin/gh/malfet/417/head 2025-12-04T12:46:35.4993953Z * [new branch] gh/malfet/417/orig -> origin/gh/malfet/417/orig 2025-12-04T12:46:35.4994021Z * [new branch] gh/malfet/506/base -> origin/gh/malfet/506/base 2025-12-04T12:46:35.4994088Z * [new branch] gh/malfet/506/head -> origin/gh/malfet/506/head 2025-12-04T12:46:35.4994157Z * [new branch] gh/malfet/506/orig -> origin/gh/malfet/506/orig 2025-12-04T12:46:35.4994224Z * [new branch] gh/malfet/517/base -> origin/gh/malfet/517/base 2025-12-04T12:46:35.4994290Z * [new branch] gh/malfet/517/head -> origin/gh/malfet/517/head 2025-12-04T12:46:35.4994359Z * [new branch] gh/malfet/528/base -> origin/gh/malfet/528/base 2025-12-04T12:46:35.4994425Z * [new branch] gh/malfet/528/head -> origin/gh/malfet/528/head 2025-12-04T12:46:35.4994492Z * [new branch] gh/malfet/528/orig -> origin/gh/malfet/528/orig 2025-12-04T12:46:35.4994584Z * [new branch] gh/malfet/537/base -> origin/gh/malfet/537/base 2025-12-04T12:46:35.4994652Z * [new branch] gh/malfet/537/head -> origin/gh/malfet/537/head 2025-12-04T12:46:35.4994718Z * [new branch] gh/malfet/537/orig -> origin/gh/malfet/537/orig 2025-12-04T12:46:35.4994816Z * [new branch] gh/malfet/546/base -> origin/gh/malfet/546/base 2025-12-04T12:46:35.4994882Z * [new branch] gh/malfet/546/head -> origin/gh/malfet/546/head 2025-12-04T12:46:35.4994949Z * [new branch] gh/malfet/546/orig -> origin/gh/malfet/546/orig 2025-12-04T12:46:35.4995018Z * [new branch] gh/malfet/565/base -> origin/gh/malfet/565/base 2025-12-04T12:46:35.4995084Z * [new branch] gh/malfet/565/head -> origin/gh/malfet/565/head 2025-12-04T12:46:35.4995151Z * [new branch] gh/malfet/565/orig -> origin/gh/malfet/565/orig 2025-12-04T12:46:35.4995220Z * [new branch] gh/malfet/575/base -> origin/gh/malfet/575/base 2025-12-04T12:46:35.4995288Z * [new branch] gh/malfet/575/head -> origin/gh/malfet/575/head 2025-12-04T12:46:35.4995354Z * [new branch] gh/malfet/575/orig -> origin/gh/malfet/575/orig 2025-12-04T12:46:35.4995423Z * [new branch] gh/malfet/580/base -> origin/gh/malfet/580/base 2025-12-04T12:46:35.4995489Z * [new branch] gh/malfet/580/head -> origin/gh/malfet/580/head 2025-12-04T12:46:35.4995557Z * [new branch] gh/malfet/580/orig -> origin/gh/malfet/580/orig 2025-12-04T12:46:35.4995623Z * [new branch] gh/malfet/581/base -> origin/gh/malfet/581/base 2025-12-04T12:46:35.4995690Z * [new branch] gh/malfet/581/head -> origin/gh/malfet/581/head 2025-12-04T12:46:35.4995758Z * [new branch] gh/malfet/581/orig -> origin/gh/malfet/581/orig 2025-12-04T12:46:35.4995826Z * [new branch] gh/malfet/583/base -> origin/gh/malfet/583/base 2025-12-04T12:46:35.4995893Z * [new branch] gh/malfet/583/head -> origin/gh/malfet/583/head 2025-12-04T12:46:35.4995961Z * [new branch] gh/malfet/583/orig -> origin/gh/malfet/583/orig 2025-12-04T12:46:35.4996028Z * [new branch] gh/malfet/586/base -> origin/gh/malfet/586/base 2025-12-04T12:46:35.4996095Z * [new branch] gh/malfet/586/head -> origin/gh/malfet/586/head 2025-12-04T12:46:35.4996164Z * [new branch] gh/malfet/586/orig -> origin/gh/malfet/586/orig 2025-12-04T12:46:35.4996230Z * [new branch] gh/malfet/587/base -> origin/gh/malfet/587/base 2025-12-04T12:46:35.4996297Z * [new branch] gh/malfet/587/head -> origin/gh/malfet/587/head 2025-12-04T12:46:35.4996365Z * [new branch] gh/malfet/587/orig -> origin/gh/malfet/587/orig 2025-12-04T12:46:35.4996432Z * [new branch] gh/malfet/588/base -> origin/gh/malfet/588/base 2025-12-04T12:46:35.4996498Z * [new branch] gh/malfet/588/head -> origin/gh/malfet/588/head 2025-12-04T12:46:35.4996566Z * [new branch] gh/malfet/588/orig -> origin/gh/malfet/588/orig 2025-12-04T12:46:35.4996634Z * [new branch] gh/malfet/589/base -> origin/gh/malfet/589/base 2025-12-04T12:46:35.4996700Z * [new branch] gh/malfet/589/head -> origin/gh/malfet/589/head 2025-12-04T12:46:35.4996767Z * [new branch] gh/malfet/589/orig -> origin/gh/malfet/589/orig 2025-12-04T12:46:35.4996833Z * [new branch] gh/malfet/590/base -> origin/gh/malfet/590/base 2025-12-04T12:46:35.4996899Z * [new branch] gh/malfet/590/head -> origin/gh/malfet/590/head 2025-12-04T12:46:35.4996967Z * [new branch] gh/malfet/590/orig -> origin/gh/malfet/590/orig 2025-12-04T12:46:35.4997060Z * [new branch] gh/malfet/591/base -> origin/gh/malfet/591/base 2025-12-04T12:46:35.4997127Z * [new branch] gh/malfet/591/head -> origin/gh/malfet/591/head 2025-12-04T12:46:35.4997195Z * [new branch] gh/malfet/591/orig -> origin/gh/malfet/591/orig 2025-12-04T12:46:35.4997298Z * [new branch] gh/malfet/592/base -> origin/gh/malfet/592/base 2025-12-04T12:46:35.4997365Z * [new branch] gh/malfet/592/head -> origin/gh/malfet/592/head 2025-12-04T12:46:35.4997432Z * [new branch] gh/malfet/592/orig -> origin/gh/malfet/592/orig 2025-12-04T12:46:35.4997546Z * [new branch] gh/malfet/593/base -> origin/gh/malfet/593/base 2025-12-04T12:46:35.4997616Z * [new branch] gh/malfet/593/head -> origin/gh/malfet/593/head 2025-12-04T12:46:35.4997683Z * [new branch] gh/malfet/593/orig -> origin/gh/malfet/593/orig 2025-12-04T12:46:35.4997751Z * [new branch] gh/malfet/594/base -> origin/gh/malfet/594/base 2025-12-04T12:46:35.4997819Z * [new branch] gh/malfet/594/head -> origin/gh/malfet/594/head 2025-12-04T12:46:35.4997885Z * [new branch] gh/malfet/594/orig -> origin/gh/malfet/594/orig 2025-12-04T12:46:35.4997951Z * [new branch] gh/malfet/595/base -> origin/gh/malfet/595/base 2025-12-04T12:46:35.4998021Z * [new branch] gh/malfet/595/head -> origin/gh/malfet/595/head 2025-12-04T12:46:35.4998088Z * [new branch] gh/malfet/595/orig -> origin/gh/malfet/595/orig 2025-12-04T12:46:35.4998154Z * [new branch] gh/malfet/596/base -> origin/gh/malfet/596/base 2025-12-04T12:46:35.4998222Z * [new branch] gh/malfet/596/head -> origin/gh/malfet/596/head 2025-12-04T12:46:35.4998289Z * [new branch] gh/malfet/596/orig -> origin/gh/malfet/596/orig 2025-12-04T12:46:35.4998356Z * [new branch] gh/malfet/597/base -> origin/gh/malfet/597/base 2025-12-04T12:46:35.4998424Z * [new branch] gh/malfet/597/head -> origin/gh/malfet/597/head 2025-12-04T12:46:35.4998490Z * [new branch] gh/malfet/597/orig -> origin/gh/malfet/597/orig 2025-12-04T12:46:35.4998556Z * [new branch] gh/malfet/598/base -> origin/gh/malfet/598/base 2025-12-04T12:46:35.4998625Z * [new branch] gh/malfet/598/head -> origin/gh/malfet/598/head 2025-12-04T12:46:35.4998691Z * [new branch] gh/malfet/598/orig -> origin/gh/malfet/598/orig 2025-12-04T12:46:35.4998758Z * [new branch] gh/malfet/599/base -> origin/gh/malfet/599/base 2025-12-04T12:46:35.4998826Z * [new branch] gh/malfet/599/head -> origin/gh/malfet/599/head 2025-12-04T12:46:35.4998892Z * [new branch] gh/malfet/599/orig -> origin/gh/malfet/599/orig 2025-12-04T12:46:35.4998961Z * [new branch] gh/malfet/600/base -> origin/gh/malfet/600/base 2025-12-04T12:46:35.4999027Z * [new branch] gh/malfet/600/head -> origin/gh/malfet/600/head 2025-12-04T12:46:35.4999094Z * [new branch] gh/malfet/600/orig -> origin/gh/malfet/600/orig 2025-12-04T12:46:35.4999162Z * [new branch] gh/malfet/601/base -> origin/gh/malfet/601/base 2025-12-04T12:46:35.4999230Z * [new branch] gh/malfet/601/head -> origin/gh/malfet/601/head 2025-12-04T12:46:35.4999297Z * [new branch] gh/malfet/601/orig -> origin/gh/malfet/601/orig 2025-12-04T12:46:35.4999364Z * [new branch] gh/malfet/602/base -> origin/gh/malfet/602/base 2025-12-04T12:46:35.4999430Z * [new branch] gh/malfet/602/head -> origin/gh/malfet/602/head 2025-12-04T12:46:35.4999496Z * [new branch] gh/malfet/602/orig -> origin/gh/malfet/602/orig 2025-12-04T12:46:35.4999603Z * [new branch] gh/malfet/603/base -> origin/gh/malfet/603/base 2025-12-04T12:46:35.4999670Z * [new branch] gh/malfet/603/head -> origin/gh/malfet/603/head 2025-12-04T12:46:35.4999737Z * [new branch] gh/malfet/603/orig -> origin/gh/malfet/603/orig 2025-12-04T12:46:35.4999805Z * [new branch] gh/malfet/604/base -> origin/gh/malfet/604/base 2025-12-04T12:46:35.4999910Z * [new branch] gh/malfet/604/head -> origin/gh/malfet/604/head 2025-12-04T12:46:35.4999976Z * [new branch] gh/malfet/604/orig -> origin/gh/malfet/604/orig 2025-12-04T12:46:35.5000044Z * [new branch] gh/malfet/605/base -> origin/gh/malfet/605/base 2025-12-04T12:46:35.5000111Z * [new branch] gh/malfet/605/head -> origin/gh/malfet/605/head 2025-12-04T12:46:35.5000178Z * [new branch] gh/malfet/605/orig -> origin/gh/malfet/605/orig 2025-12-04T12:46:35.5000246Z * [new branch] gh/malfet/606/base -> origin/gh/malfet/606/base 2025-12-04T12:46:35.5000313Z * [new branch] gh/malfet/606/head -> origin/gh/malfet/606/head 2025-12-04T12:46:35.5000380Z * [new branch] gh/malfet/606/orig -> origin/gh/malfet/606/orig 2025-12-04T12:46:35.5000447Z * [new branch] gh/malfet/607/base -> origin/gh/malfet/607/base 2025-12-04T12:46:35.5000515Z * [new branch] gh/malfet/607/head -> origin/gh/malfet/607/head 2025-12-04T12:46:35.5000583Z * [new branch] gh/malfet/607/orig -> origin/gh/malfet/607/orig 2025-12-04T12:46:35.5000649Z * [new branch] gh/malfet/608/base -> origin/gh/malfet/608/base 2025-12-04T12:46:35.5000715Z * [new branch] gh/malfet/608/head -> origin/gh/malfet/608/head 2025-12-04T12:46:35.5000783Z * [new branch] gh/malfet/608/orig -> origin/gh/malfet/608/orig 2025-12-04T12:46:35.5000850Z * [new branch] gh/malfet/609/base -> origin/gh/malfet/609/base 2025-12-04T12:46:35.5000917Z * [new branch] gh/malfet/609/head -> origin/gh/malfet/609/head 2025-12-04T12:46:35.5000984Z * [new branch] gh/malfet/609/orig -> origin/gh/malfet/609/orig 2025-12-04T12:46:35.5001051Z * [new branch] gh/malfet/610/base -> origin/gh/malfet/610/base 2025-12-04T12:46:35.5001119Z * [new branch] gh/malfet/610/head -> origin/gh/malfet/610/head 2025-12-04T12:46:35.5001187Z * [new branch] gh/malfet/610/orig -> origin/gh/malfet/610/orig 2025-12-04T12:46:35.5001254Z * [new branch] gh/malfet/611/base -> origin/gh/malfet/611/base 2025-12-04T12:46:35.5001321Z * [new branch] gh/malfet/611/head -> origin/gh/malfet/611/head 2025-12-04T12:46:35.5001388Z * [new branch] gh/malfet/611/orig -> origin/gh/malfet/611/orig 2025-12-04T12:46:35.5001456Z * [new branch] gh/malfet/612/base -> origin/gh/malfet/612/base 2025-12-04T12:46:35.5001522Z * [new branch] gh/malfet/612/head -> origin/gh/malfet/612/head 2025-12-04T12:46:35.5001590Z * [new branch] gh/malfet/612/orig -> origin/gh/malfet/612/orig 2025-12-04T12:46:35.5001658Z * [new branch] gh/malfet/64/base -> origin/gh/malfet/64/base 2025-12-04T12:46:35.5001726Z * [new branch] gh/malfet/64/head -> origin/gh/malfet/64/head 2025-12-04T12:46:35.5001817Z * [new branch] gh/manuelcandales/11/base -> origin/gh/manuelcandales/11/base 2025-12-04T12:46:35.5001903Z * [new branch] gh/manuelcandales/11/head -> origin/gh/manuelcandales/11/head 2025-12-04T12:46:35.5001985Z * [new branch] gh/manuelcandales/11/orig -> origin/gh/manuelcandales/11/orig 2025-12-04T12:46:35.5002055Z * [new branch] gh/markkm/1/base -> origin/gh/markkm/1/base 2025-12-04T12:46:35.5002153Z * [new branch] gh/masnesral/1/base -> origin/gh/masnesral/1/base 2025-12-04T12:46:35.5002225Z * [new branch] gh/masnesral/1/head -> origin/gh/masnesral/1/head 2025-12-04T12:46:35.5002296Z * [new branch] gh/masnesral/1/orig -> origin/gh/masnesral/1/orig 2025-12-04T12:46:35.5002367Z * [new branch] gh/mhorowitz/0/base -> origin/gh/mhorowitz/0/base 2025-12-04T12:46:35.5002462Z * [new branch] gh/mhorowitz/0/head -> origin/gh/mhorowitz/0/head 2025-12-04T12:46:35.5002531Z * [new branch] gh/mhorowitz/1/base -> origin/gh/mhorowitz/1/base 2025-12-04T12:46:35.5002601Z * [new branch] gh/mhorowitz/1/head -> origin/gh/mhorowitz/1/head 2025-12-04T12:46:35.5002671Z * [new branch] gh/mhorowitz/2/base -> origin/gh/mhorowitz/2/base 2025-12-04T12:46:35.5002739Z * [new branch] gh/mhorowitz/2/head -> origin/gh/mhorowitz/2/head 2025-12-04T12:46:35.5002809Z * [new branch] gh/mhorowitz/3/base -> origin/gh/mhorowitz/3/base 2025-12-04T12:46:35.5002879Z * [new branch] gh/mhorowitz/3/head -> origin/gh/mhorowitz/3/head 2025-12-04T12:46:35.5002949Z * [new branch] gh/mhorowitz/4/base -> origin/gh/mhorowitz/4/base 2025-12-04T12:46:35.5003018Z * [new branch] gh/mhorowitz/4/head -> origin/gh/mhorowitz/4/head 2025-12-04T12:46:35.5003092Z * [new branch] gh/mhorowitz/5/base -> origin/gh/mhorowitz/5/base 2025-12-04T12:46:35.5003161Z * [new branch] gh/mhorowitz/5/head -> origin/gh/mhorowitz/5/head 2025-12-04T12:46:35.5003230Z * [new branch] gh/mhorowitz/6/base -> origin/gh/mhorowitz/6/base 2025-12-04T12:46:35.5003301Z * [new branch] gh/mhorowitz/6/head -> origin/gh/mhorowitz/6/head 2025-12-04T12:46:35.5003400Z * [new branch] gh/mikaylagawarecki/234/base -> origin/gh/mikaylagawarecki/234/base 2025-12-04T12:46:35.5003497Z * [new branch] gh/mikaylagawarecki/234/head -> origin/gh/mikaylagawarecki/234/head 2025-12-04T12:46:35.5003589Z * [new branch] gh/mikaylagawarecki/235/base -> origin/gh/mikaylagawarecki/235/base 2025-12-04T12:46:35.5003681Z * [new branch] gh/mikaylagawarecki/235/head -> origin/gh/mikaylagawarecki/235/head 2025-12-04T12:46:35.5003773Z * [new branch] gh/mikaylagawarecki/236/base -> origin/gh/mikaylagawarecki/236/base 2025-12-04T12:46:35.5003865Z * [new branch] gh/mikaylagawarecki/236/head -> origin/gh/mikaylagawarecki/236/head 2025-12-04T12:46:35.5003955Z * [new branch] gh/mikaylagawarecki/237/base -> origin/gh/mikaylagawarecki/237/base 2025-12-04T12:46:35.5004048Z * [new branch] gh/mikaylagawarecki/237/head -> origin/gh/mikaylagawarecki/237/head 2025-12-04T12:46:35.5004139Z * [new branch] gh/mikaylagawarecki/238/base -> origin/gh/mikaylagawarecki/238/base 2025-12-04T12:46:35.5004231Z * [new branch] gh/mikaylagawarecki/238/head -> origin/gh/mikaylagawarecki/238/head 2025-12-04T12:46:35.5004324Z * [new branch] gh/mikaylagawarecki/336/base -> origin/gh/mikaylagawarecki/336/base 2025-12-04T12:46:35.5004415Z * [new branch] gh/mikaylagawarecki/336/head -> origin/gh/mikaylagawarecki/336/head 2025-12-04T12:46:35.5004507Z * [new branch] gh/mikaylagawarecki/336/orig -> origin/gh/mikaylagawarecki/336/orig 2025-12-04T12:46:35.5004600Z * [new branch] gh/mikaylagawarecki/341/base -> origin/gh/mikaylagawarecki/341/base 2025-12-04T12:46:35.5004691Z * [new branch] gh/mikaylagawarecki/341/head -> origin/gh/mikaylagawarecki/341/head 2025-12-04T12:46:35.5004783Z * [new branch] gh/mikaylagawarecki/341/orig -> origin/gh/mikaylagawarecki/341/orig 2025-12-04T12:46:35.5004875Z * [new branch] gh/mikaylagawarecki/342/base -> origin/gh/mikaylagawarecki/342/base 2025-12-04T12:46:35.5004991Z * [new branch] gh/mikaylagawarecki/342/head -> origin/gh/mikaylagawarecki/342/head 2025-12-04T12:46:35.5005083Z * [new branch] gh/mikaylagawarecki/342/orig -> origin/gh/mikaylagawarecki/342/orig 2025-12-04T12:46:35.5005175Z * [new branch] gh/mikaylagawarecki/345/base -> origin/gh/mikaylagawarecki/345/base 2025-12-04T12:46:35.5005297Z * [new branch] gh/mikaylagawarecki/345/head -> origin/gh/mikaylagawarecki/345/head 2025-12-04T12:46:35.5005391Z * [new branch] gh/mikaylagawarecki/345/orig -> origin/gh/mikaylagawarecki/345/orig 2025-12-04T12:46:35.5005481Z * [new branch] gh/mikaylagawarecki/346/base -> origin/gh/mikaylagawarecki/346/base 2025-12-04T12:46:35.5005571Z * [new branch] gh/mikaylagawarecki/346/head -> origin/gh/mikaylagawarecki/346/head 2025-12-04T12:46:35.5005663Z * [new branch] gh/mikaylagawarecki/346/orig -> origin/gh/mikaylagawarecki/346/orig 2025-12-04T12:46:35.5005754Z * [new branch] gh/mikaylagawarecki/347/base -> origin/gh/mikaylagawarecki/347/base 2025-12-04T12:46:35.5005844Z * [new branch] gh/mikaylagawarecki/347/head -> origin/gh/mikaylagawarecki/347/head 2025-12-04T12:46:35.5005936Z * [new branch] gh/mikaylagawarecki/347/orig -> origin/gh/mikaylagawarecki/347/orig 2025-12-04T12:46:35.5006027Z * [new branch] gh/mikaylagawarecki/350/base -> origin/gh/mikaylagawarecki/350/base 2025-12-04T12:46:35.5006118Z * [new branch] gh/mikaylagawarecki/350/head -> origin/gh/mikaylagawarecki/350/head 2025-12-04T12:46:35.5006210Z * [new branch] gh/mikaylagawarecki/350/orig -> origin/gh/mikaylagawarecki/350/orig 2025-12-04T12:46:35.5006301Z * [new branch] gh/mikaylagawarecki/351/base -> origin/gh/mikaylagawarecki/351/base 2025-12-04T12:46:35.5006393Z * [new branch] gh/mikaylagawarecki/351/head -> origin/gh/mikaylagawarecki/351/head 2025-12-04T12:46:35.5006487Z * [new branch] gh/mikaylagawarecki/351/orig -> origin/gh/mikaylagawarecki/351/orig 2025-12-04T12:46:35.5006577Z * [new branch] gh/mikaylagawarecki/352/base -> origin/gh/mikaylagawarecki/352/base 2025-12-04T12:46:35.5006669Z * [new branch] gh/mikaylagawarecki/352/head -> origin/gh/mikaylagawarecki/352/head 2025-12-04T12:46:35.5006760Z * [new branch] gh/mikaylagawarecki/352/orig -> origin/gh/mikaylagawarecki/352/orig 2025-12-04T12:46:35.5006852Z * [new branch] gh/mikaylagawarecki/353/base -> origin/gh/mikaylagawarecki/353/base 2025-12-04T12:46:35.5006944Z * [new branch] gh/mikaylagawarecki/353/head -> origin/gh/mikaylagawarecki/353/head 2025-12-04T12:46:35.5007035Z * [new branch] gh/mikaylagawarecki/353/orig -> origin/gh/mikaylagawarecki/353/orig 2025-12-04T12:46:35.5007126Z * [new branch] gh/mikaylagawarecki/354/base -> origin/gh/mikaylagawarecki/354/base 2025-12-04T12:46:35.5007218Z * [new branch] gh/mikaylagawarecki/354/head -> origin/gh/mikaylagawarecki/354/head 2025-12-04T12:46:35.5007309Z * [new branch] gh/mikaylagawarecki/354/orig -> origin/gh/mikaylagawarecki/354/orig 2025-12-04T12:46:35.5007400Z * [new branch] gh/mikaylagawarecki/356/base -> origin/gh/mikaylagawarecki/356/base 2025-12-04T12:46:35.5007538Z * [new branch] gh/mikaylagawarecki/356/head -> origin/gh/mikaylagawarecki/356/head 2025-12-04T12:46:35.5007631Z * [new branch] gh/mikaylagawarecki/356/orig -> origin/gh/mikaylagawarecki/356/orig 2025-12-04T12:46:35.5007722Z * [new branch] gh/mikaylagawarecki/357/base -> origin/gh/mikaylagawarecki/357/base 2025-12-04T12:46:35.5007814Z * [new branch] gh/mikaylagawarecki/357/head -> origin/gh/mikaylagawarecki/357/head 2025-12-04T12:46:35.5007904Z * [new branch] gh/mikaylagawarecki/357/orig -> origin/gh/mikaylagawarecki/357/orig 2025-12-04T12:46:35.5008034Z * [new branch] gh/mikaylagawarecki/359/base -> origin/gh/mikaylagawarecki/359/base 2025-12-04T12:46:35.5008127Z * [new branch] gh/mikaylagawarecki/359/head -> origin/gh/mikaylagawarecki/359/head 2025-12-04T12:46:35.5008218Z * [new branch] gh/mikaylagawarecki/359/orig -> origin/gh/mikaylagawarecki/359/orig 2025-12-04T12:46:35.5008974Z * [new branch] gh/mikaylagawarecki/360/base -> origin/gh/mikaylagawarecki/360/base 2025-12-04T12:46:35.5009065Z * [new branch] gh/mikaylagawarecki/360/head -> origin/gh/mikaylagawarecki/360/head 2025-12-04T12:46:35.5009157Z * [new branch] gh/mikaylagawarecki/360/orig -> origin/gh/mikaylagawarecki/360/orig 2025-12-04T12:46:35.5009249Z * [new branch] gh/mikaylagawarecki/361/base -> origin/gh/mikaylagawarecki/361/base 2025-12-04T12:46:35.5009340Z * [new branch] gh/mikaylagawarecki/361/head -> origin/gh/mikaylagawarecki/361/head 2025-12-04T12:46:35.5009432Z * [new branch] gh/mikaylagawarecki/361/orig -> origin/gh/mikaylagawarecki/361/orig 2025-12-04T12:46:35.5009524Z * [new branch] gh/mikaylagawarecki/362/base -> origin/gh/mikaylagawarecki/362/base 2025-12-04T12:46:35.5009615Z * [new branch] gh/mikaylagawarecki/362/head -> origin/gh/mikaylagawarecki/362/head 2025-12-04T12:46:35.5009707Z * [new branch] gh/mikaylagawarecki/362/orig -> origin/gh/mikaylagawarecki/362/orig 2025-12-04T12:46:35.5009801Z * [new branch] gh/mikaylagawarecki/363/base -> origin/gh/mikaylagawarecki/363/base 2025-12-04T12:46:35.5009892Z * [new branch] gh/mikaylagawarecki/363/head -> origin/gh/mikaylagawarecki/363/head 2025-12-04T12:46:35.5009983Z * [new branch] gh/mikaylagawarecki/363/orig -> origin/gh/mikaylagawarecki/363/orig 2025-12-04T12:46:35.5010074Z * [new branch] gh/mikaylagawarecki/364/base -> origin/gh/mikaylagawarecki/364/base 2025-12-04T12:46:35.5010166Z * [new branch] gh/mikaylagawarecki/364/head -> origin/gh/mikaylagawarecki/364/head 2025-12-04T12:46:35.5010258Z * [new branch] gh/mikaylagawarecki/364/orig -> origin/gh/mikaylagawarecki/364/orig 2025-12-04T12:46:35.5010349Z * [new branch] gh/mikaylagawarecki/365/base -> origin/gh/mikaylagawarecki/365/base 2025-12-04T12:46:35.5010441Z * [new branch] gh/mikaylagawarecki/365/head -> origin/gh/mikaylagawarecki/365/head 2025-12-04T12:46:35.5010533Z * [new branch] gh/mikaylagawarecki/365/orig -> origin/gh/mikaylagawarecki/365/orig 2025-12-04T12:46:35.5010623Z * [new branch] gh/mikaylagawarecki/366/base -> origin/gh/mikaylagawarecki/366/base 2025-12-04T12:46:35.5010714Z * [new branch] gh/mikaylagawarecki/366/head -> origin/gh/mikaylagawarecki/366/head 2025-12-04T12:46:35.5010805Z * [new branch] gh/mikaylagawarecki/366/orig -> origin/gh/mikaylagawarecki/366/orig 2025-12-04T12:46:35.5010898Z * [new branch] gh/mikaylagawarecki/367/base -> origin/gh/mikaylagawarecki/367/base 2025-12-04T12:46:35.5010988Z * [new branch] gh/mikaylagawarecki/367/head -> origin/gh/mikaylagawarecki/367/head 2025-12-04T12:46:35.5011081Z * [new branch] gh/mikaylagawarecki/367/orig -> origin/gh/mikaylagawarecki/367/orig 2025-12-04T12:46:35.5011173Z * [new branch] gh/mikaylagawarecki/368/base -> origin/gh/mikaylagawarecki/368/base 2025-12-04T12:46:35.5011264Z * [new branch] gh/mikaylagawarecki/368/head -> origin/gh/mikaylagawarecki/368/head 2025-12-04T12:46:35.5011357Z * [new branch] gh/mikaylagawarecki/368/orig -> origin/gh/mikaylagawarecki/368/orig 2025-12-04T12:46:35.5011447Z * [new branch] gh/mikaylagawarecki/369/base -> origin/gh/mikaylagawarecki/369/base 2025-12-04T12:46:35.5011538Z * [new branch] gh/mikaylagawarecki/369/head -> origin/gh/mikaylagawarecki/369/head 2025-12-04T12:46:35.5011656Z * [new branch] gh/mikaylagawarecki/369/orig -> origin/gh/mikaylagawarecki/369/orig 2025-12-04T12:46:35.5011748Z * [new branch] gh/mikaylagawarecki/370/base -> origin/gh/mikaylagawarecki/370/base 2025-12-04T12:46:35.5011841Z * [new branch] gh/mikaylagawarecki/370/head -> origin/gh/mikaylagawarecki/370/head 2025-12-04T12:46:35.5011953Z * [new branch] gh/mikaylagawarecki/370/orig -> origin/gh/mikaylagawarecki/370/orig 2025-12-04T12:46:35.5012044Z * [new branch] gh/mikaylagawarecki/371/base -> origin/gh/mikaylagawarecki/371/base 2025-12-04T12:46:35.5012135Z * [new branch] gh/mikaylagawarecki/371/head -> origin/gh/mikaylagawarecki/371/head 2025-12-04T12:46:35.5012225Z * [new branch] gh/mikaylagawarecki/371/orig -> origin/gh/mikaylagawarecki/371/orig 2025-12-04T12:46:35.5012316Z * [new branch] gh/mikaylagawarecki/372/base -> origin/gh/mikaylagawarecki/372/base 2025-12-04T12:46:35.5012408Z * [new branch] gh/mikaylagawarecki/372/head -> origin/gh/mikaylagawarecki/372/head 2025-12-04T12:46:35.5012500Z * [new branch] gh/mikaylagawarecki/372/orig -> origin/gh/mikaylagawarecki/372/orig 2025-12-04T12:46:35.5012591Z * [new branch] gh/mikaylagawarecki/373/base -> origin/gh/mikaylagawarecki/373/base 2025-12-04T12:46:35.5012685Z * [new branch] gh/mikaylagawarecki/373/head -> origin/gh/mikaylagawarecki/373/head 2025-12-04T12:46:35.5012777Z * [new branch] gh/mikaylagawarecki/373/orig -> origin/gh/mikaylagawarecki/373/orig 2025-12-04T12:46:35.5012867Z * [new branch] gh/mikaylagawarecki/374/base -> origin/gh/mikaylagawarecki/374/base 2025-12-04T12:46:35.5012959Z * [new branch] gh/mikaylagawarecki/374/head -> origin/gh/mikaylagawarecki/374/head 2025-12-04T12:46:35.5013050Z * [new branch] gh/mikaylagawarecki/374/orig -> origin/gh/mikaylagawarecki/374/orig 2025-12-04T12:46:35.5013143Z * [new branch] gh/mikaylagawarecki/375/base -> origin/gh/mikaylagawarecki/375/base 2025-12-04T12:46:35.5013234Z * [new branch] gh/mikaylagawarecki/375/head -> origin/gh/mikaylagawarecki/375/head 2025-12-04T12:46:35.5013325Z * [new branch] gh/mikaylagawarecki/375/orig -> origin/gh/mikaylagawarecki/375/orig 2025-12-04T12:46:35.5013418Z * [new branch] gh/mikaylagawarecki/376/base -> origin/gh/mikaylagawarecki/376/base 2025-12-04T12:46:35.5013510Z * [new branch] gh/mikaylagawarecki/376/head -> origin/gh/mikaylagawarecki/376/head 2025-12-04T12:46:35.5013600Z * [new branch] gh/mikaylagawarecki/376/orig -> origin/gh/mikaylagawarecki/376/orig 2025-12-04T12:46:35.5013693Z * [new branch] gh/mikaylagawarecki/377/base -> origin/gh/mikaylagawarecki/377/base 2025-12-04T12:46:35.5013784Z * [new branch] gh/mikaylagawarecki/377/head -> origin/gh/mikaylagawarecki/377/head 2025-12-04T12:46:35.5013876Z * [new branch] gh/mikaylagawarecki/377/orig -> origin/gh/mikaylagawarecki/377/orig 2025-12-04T12:46:35.5013967Z * [new branch] gh/mikaylagawarecki/378/base -> origin/gh/mikaylagawarecki/378/base 2025-12-04T12:46:35.5014058Z * [new branch] gh/mikaylagawarecki/378/head -> origin/gh/mikaylagawarecki/378/head 2025-12-04T12:46:35.5014150Z * [new branch] gh/mikaylagawarecki/378/orig -> origin/gh/mikaylagawarecki/378/orig 2025-12-04T12:46:35.5014242Z * [new branch] gh/mikaylagawarecki/379/base -> origin/gh/mikaylagawarecki/379/base 2025-12-04T12:46:35.5014333Z * [new branch] gh/mikaylagawarecki/379/head -> origin/gh/mikaylagawarecki/379/head 2025-12-04T12:46:35.5014424Z * [new branch] gh/mikaylagawarecki/379/orig -> origin/gh/mikaylagawarecki/379/orig 2025-12-04T12:46:35.5014515Z * [new branch] gh/mikaylagawarecki/380/base -> origin/gh/mikaylagawarecki/380/base 2025-12-04T12:46:35.5014632Z * [new branch] gh/mikaylagawarecki/380/head -> origin/gh/mikaylagawarecki/380/head 2025-12-04T12:46:35.5014724Z * [new branch] gh/mikaylagawarecki/380/orig -> origin/gh/mikaylagawarecki/380/orig 2025-12-04T12:46:35.5014815Z * [new branch] gh/mikaylagawarecki/381/base -> origin/gh/mikaylagawarecki/381/base 2025-12-04T12:46:35.5014927Z * [new branch] gh/mikaylagawarecki/381/head -> origin/gh/mikaylagawarecki/381/head 2025-12-04T12:46:35.5015019Z * [new branch] gh/mikaylagawarecki/381/orig -> origin/gh/mikaylagawarecki/381/orig 2025-12-04T12:46:35.5015110Z * [new branch] gh/mikaylagawarecki/382/base -> origin/gh/mikaylagawarecki/382/base 2025-12-04T12:46:35.5015200Z * [new branch] gh/mikaylagawarecki/382/head -> origin/gh/mikaylagawarecki/382/head 2025-12-04T12:46:35.5015292Z * [new branch] gh/mikaylagawarecki/382/orig -> origin/gh/mikaylagawarecki/382/orig 2025-12-04T12:46:35.5015382Z * [new branch] gh/mikaylagawarecki/383/base -> origin/gh/mikaylagawarecki/383/base 2025-12-04T12:46:35.5015473Z * [new branch] gh/mikaylagawarecki/383/head -> origin/gh/mikaylagawarecki/383/head 2025-12-04T12:46:35.5015564Z * [new branch] gh/mikaylagawarecki/383/orig -> origin/gh/mikaylagawarecki/383/orig 2025-12-04T12:46:35.5015656Z * [new branch] gh/mikaylagawarecki/384/base -> origin/gh/mikaylagawarecki/384/base 2025-12-04T12:46:35.5015746Z * [new branch] gh/mikaylagawarecki/384/head -> origin/gh/mikaylagawarecki/384/head 2025-12-04T12:46:35.5015837Z * [new branch] gh/mikaylagawarecki/384/orig -> origin/gh/mikaylagawarecki/384/orig 2025-12-04T12:46:35.5015927Z * [new branch] gh/mikaylagawarecki/385/base -> origin/gh/mikaylagawarecki/385/base 2025-12-04T12:46:35.5016018Z * [new branch] gh/mikaylagawarecki/385/head -> origin/gh/mikaylagawarecki/385/head 2025-12-04T12:46:35.5016110Z * [new branch] gh/mikaylagawarecki/385/orig -> origin/gh/mikaylagawarecki/385/orig 2025-12-04T12:46:35.5016200Z * [new branch] gh/mikaylagawarecki/386/base -> origin/gh/mikaylagawarecki/386/base 2025-12-04T12:46:35.5016292Z * [new branch] gh/mikaylagawarecki/386/head -> origin/gh/mikaylagawarecki/386/head 2025-12-04T12:46:35.5016384Z * [new branch] gh/mikaylagawarecki/386/orig -> origin/gh/mikaylagawarecki/386/orig 2025-12-04T12:46:35.5016474Z * [new branch] gh/mikaylagawarecki/387/base -> origin/gh/mikaylagawarecki/387/base 2025-12-04T12:46:35.5016567Z * [new branch] gh/mikaylagawarecki/387/head -> origin/gh/mikaylagawarecki/387/head 2025-12-04T12:46:35.5016660Z * [new branch] gh/mikaylagawarecki/387/orig -> origin/gh/mikaylagawarecki/387/orig 2025-12-04T12:46:35.5016750Z * [new branch] gh/mikaylagawarecki/388/base -> origin/gh/mikaylagawarecki/388/base 2025-12-04T12:46:35.5016843Z * [new branch] gh/mikaylagawarecki/388/head -> origin/gh/mikaylagawarecki/388/head 2025-12-04T12:46:35.5016933Z * [new branch] gh/mikaylagawarecki/388/orig -> origin/gh/mikaylagawarecki/388/orig 2025-12-04T12:46:35.5017024Z * [new branch] gh/mikaylagawarecki/389/base -> origin/gh/mikaylagawarecki/389/base 2025-12-04T12:46:35.5017117Z * [new branch] gh/mikaylagawarecki/389/head -> origin/gh/mikaylagawarecki/389/head 2025-12-04T12:46:35.5017207Z * [new branch] gh/mikaylagawarecki/389/orig -> origin/gh/mikaylagawarecki/389/orig 2025-12-04T12:46:35.5017298Z * [new branch] gh/mikaylagawarecki/390/base -> origin/gh/mikaylagawarecki/390/base 2025-12-04T12:46:35.5017389Z * [new branch] gh/mikaylagawarecki/390/head -> origin/gh/mikaylagawarecki/390/head 2025-12-04T12:46:35.5017533Z * [new branch] gh/mikaylagawarecki/390/orig -> origin/gh/mikaylagawarecki/390/orig 2025-12-04T12:46:35.5017671Z * [new branch] gh/mikaylagawarecki/391/base -> origin/gh/mikaylagawarecki/391/base 2025-12-04T12:46:35.5017762Z * [new branch] gh/mikaylagawarecki/391/head -> origin/gh/mikaylagawarecki/391/head 2025-12-04T12:46:35.5017855Z * [new branch] gh/mikaylagawarecki/391/orig -> origin/gh/mikaylagawarecki/391/orig 2025-12-04T12:46:35.5017988Z * [new branch] gh/mikaylagawarecki/392/base -> origin/gh/mikaylagawarecki/392/base 2025-12-04T12:46:35.5018079Z * [new branch] gh/mikaylagawarecki/392/head -> origin/gh/mikaylagawarecki/392/head 2025-12-04T12:46:35.5018169Z * [new branch] gh/mikaylagawarecki/392/orig -> origin/gh/mikaylagawarecki/392/orig 2025-12-04T12:46:35.5018260Z * [new branch] gh/mikaylagawarecki/393/base -> origin/gh/mikaylagawarecki/393/base 2025-12-04T12:46:35.5018351Z * [new branch] gh/mikaylagawarecki/393/head -> origin/gh/mikaylagawarecki/393/head 2025-12-04T12:46:35.5018444Z * [new branch] gh/mikaylagawarecki/393/orig -> origin/gh/mikaylagawarecki/393/orig 2025-12-04T12:46:35.5018515Z * [new branch] gh/mlazos/41/base -> origin/gh/mlazos/41/base 2025-12-04T12:46:35.5018583Z * [new branch] gh/mlazos/41/head -> origin/gh/mlazos/41/head 2025-12-04T12:46:35.5018651Z * [new branch] gh/mlazos/41/orig -> origin/gh/mlazos/41/orig 2025-12-04T12:46:35.5018721Z * [new branch] gh/mlazos/42/base -> origin/gh/mlazos/42/base 2025-12-04T12:46:35.5018787Z * [new branch] gh/mlazos/42/head -> origin/gh/mlazos/42/head 2025-12-04T12:46:35.5018853Z * [new branch] gh/mlazos/42/orig -> origin/gh/mlazos/42/orig 2025-12-04T12:46:35.5018920Z * [new branch] gh/mlazos/43/base -> origin/gh/mlazos/43/base 2025-12-04T12:46:35.5018986Z * [new branch] gh/mlazos/43/head -> origin/gh/mlazos/43/head 2025-12-04T12:46:35.5019054Z * [new branch] gh/mlazos/43/orig -> origin/gh/mlazos/43/orig 2025-12-04T12:46:35.5019120Z * [new branch] gh/mlazos/44/base -> origin/gh/mlazos/44/base 2025-12-04T12:46:35.5019186Z * [new branch] gh/mlazos/44/head -> origin/gh/mlazos/44/head 2025-12-04T12:46:35.5019253Z * [new branch] gh/mlazos/44/orig -> origin/gh/mlazos/44/orig 2025-12-04T12:46:35.5019320Z * [new branch] gh/mlazos/47/base -> origin/gh/mlazos/47/base 2025-12-04T12:46:35.5019386Z * [new branch] gh/mlazos/47/head -> origin/gh/mlazos/47/head 2025-12-04T12:46:35.5019453Z * [new branch] gh/mlazos/47/orig -> origin/gh/mlazos/47/orig 2025-12-04T12:46:35.5019519Z * [new branch] gh/mlazos/48/base -> origin/gh/mlazos/48/base 2025-12-04T12:46:35.5019585Z * [new branch] gh/mlazos/48/head -> origin/gh/mlazos/48/head 2025-12-04T12:46:35.5019654Z * [new branch] gh/mlazos/48/orig -> origin/gh/mlazos/48/orig 2025-12-04T12:46:35.5019721Z * [new branch] gh/mlazos/49/base -> origin/gh/mlazos/49/base 2025-12-04T12:46:35.5019787Z * [new branch] gh/mlazos/49/head -> origin/gh/mlazos/49/head 2025-12-04T12:46:35.5019854Z * [new branch] gh/mlazos/49/orig -> origin/gh/mlazos/49/orig 2025-12-04T12:46:35.5019922Z * [new branch] gh/mlazos/50/base -> origin/gh/mlazos/50/base 2025-12-04T12:46:35.5019988Z * [new branch] gh/mlazos/50/head -> origin/gh/mlazos/50/head 2025-12-04T12:46:35.5020056Z * [new branch] gh/mlazos/50/orig -> origin/gh/mlazos/50/orig 2025-12-04T12:46:35.5020122Z * [new branch] gh/mlazos/51/base -> origin/gh/mlazos/51/base 2025-12-04T12:46:35.5020188Z * [new branch] gh/mlazos/51/head -> origin/gh/mlazos/51/head 2025-12-04T12:46:35.5020292Z * [new branch] gh/mlazos/51/orig -> origin/gh/mlazos/51/orig 2025-12-04T12:46:35.5020358Z * [new branch] gh/mlazos/52/base -> origin/gh/mlazos/52/base 2025-12-04T12:46:35.5020424Z * [new branch] gh/mlazos/52/head -> origin/gh/mlazos/52/head 2025-12-04T12:46:35.5020491Z * [new branch] gh/mlazos/52/orig -> origin/gh/mlazos/52/orig 2025-12-04T12:46:35.5020579Z * [new branch] gh/mlazos/53/base -> origin/gh/mlazos/53/base 2025-12-04T12:46:35.5020647Z * [new branch] gh/mlazos/53/head -> origin/gh/mlazos/53/head 2025-12-04T12:46:35.5020713Z * [new branch] gh/mlazos/53/orig -> origin/gh/mlazos/53/orig 2025-12-04T12:46:35.5020778Z * [new branch] gh/mlazos/54/base -> origin/gh/mlazos/54/base 2025-12-04T12:46:35.5020845Z * [new branch] gh/mlazos/54/head -> origin/gh/mlazos/54/head 2025-12-04T12:46:35.5020912Z * [new branch] gh/mlazos/54/orig -> origin/gh/mlazos/54/orig 2025-12-04T12:46:35.5020977Z * [new branch] gh/mlazos/55/base -> origin/gh/mlazos/55/base 2025-12-04T12:46:35.5021046Z * [new branch] gh/mlazos/55/head -> origin/gh/mlazos/55/head 2025-12-04T12:46:35.5021111Z * [new branch] gh/mlazos/55/orig -> origin/gh/mlazos/55/orig 2025-12-04T12:46:35.5021178Z * [new branch] gh/mlazos/56/base -> origin/gh/mlazos/56/base 2025-12-04T12:46:35.5021246Z * [new branch] gh/mlazos/56/head -> origin/gh/mlazos/56/head 2025-12-04T12:46:35.5021312Z * [new branch] gh/mlazos/56/orig -> origin/gh/mlazos/56/orig 2025-12-04T12:46:35.5021378Z * [new branch] gh/mlazos/57/base -> origin/gh/mlazos/57/base 2025-12-04T12:46:35.5021445Z * [new branch] gh/mlazos/57/head -> origin/gh/mlazos/57/head 2025-12-04T12:46:35.5021510Z * [new branch] gh/mlazos/57/orig -> origin/gh/mlazos/57/orig 2025-12-04T12:46:35.5021577Z * [new branch] gh/mlazos/58/base -> origin/gh/mlazos/58/base 2025-12-04T12:46:35.5021645Z * [new branch] gh/mlazos/58/head -> origin/gh/mlazos/58/head 2025-12-04T12:46:35.5021711Z * [new branch] gh/mlazos/58/orig -> origin/gh/mlazos/58/orig 2025-12-04T12:46:35.5021778Z * [new branch] gh/mlazos/59/base -> origin/gh/mlazos/59/base 2025-12-04T12:46:35.5021845Z * [new branch] gh/mlazos/59/head -> origin/gh/mlazos/59/head 2025-12-04T12:46:35.5021912Z * [new branch] gh/mlazos/59/orig -> origin/gh/mlazos/59/orig 2025-12-04T12:46:35.5021977Z * [new branch] gh/mlazos/60/base -> origin/gh/mlazos/60/base 2025-12-04T12:46:35.5022044Z * [new branch] gh/mlazos/60/head -> origin/gh/mlazos/60/head 2025-12-04T12:46:35.5022110Z * [new branch] gh/mlazos/60/orig -> origin/gh/mlazos/60/orig 2025-12-04T12:46:35.5022177Z * [new branch] gh/mlazos/61/base -> origin/gh/mlazos/61/base 2025-12-04T12:46:35.5022244Z * [new branch] gh/mlazos/61/head -> origin/gh/mlazos/61/head 2025-12-04T12:46:35.5022310Z * [new branch] gh/mlazos/61/orig -> origin/gh/mlazos/61/orig 2025-12-04T12:46:35.5022378Z * [new branch] gh/mlazos/62/base -> origin/gh/mlazos/62/base 2025-12-04T12:46:35.5022445Z * [new branch] gh/mlazos/62/head -> origin/gh/mlazos/62/head 2025-12-04T12:46:35.5022510Z * [new branch] gh/mlazos/62/orig -> origin/gh/mlazos/62/orig 2025-12-04T12:46:35.5022577Z * [new branch] gh/mlazos/63/base -> origin/gh/mlazos/63/base 2025-12-04T12:46:35.5022643Z * [new branch] gh/mlazos/63/head -> origin/gh/mlazos/63/head 2025-12-04T12:46:35.5022709Z * [new branch] gh/mlazos/63/orig -> origin/gh/mlazos/63/orig 2025-12-04T12:46:35.5022800Z * [new branch] gh/mlazos/64/base -> origin/gh/mlazos/64/base 2025-12-04T12:46:35.5022866Z * [new branch] gh/mlazos/64/head -> origin/gh/mlazos/64/head 2025-12-04T12:46:35.5022932Z * [new branch] gh/mlazos/64/orig -> origin/gh/mlazos/64/orig 2025-12-04T12:46:35.5022999Z * [new branch] gh/mlazos/65/base -> origin/gh/mlazos/65/base 2025-12-04T12:46:35.5023088Z * [new branch] gh/mlazos/65/head -> origin/gh/mlazos/65/head 2025-12-04T12:46:35.5023154Z * [new branch] gh/mlazos/65/orig -> origin/gh/mlazos/65/orig 2025-12-04T12:46:35.5023221Z * [new branch] gh/mlazos/66/base -> origin/gh/mlazos/66/base 2025-12-04T12:46:35.5023286Z * [new branch] gh/mlazos/66/head -> origin/gh/mlazos/66/head 2025-12-04T12:46:35.5023352Z * [new branch] gh/mlazos/66/orig -> origin/gh/mlazos/66/orig 2025-12-04T12:46:35.5023420Z * [new branch] gh/mlazos/67/base -> origin/gh/mlazos/67/base 2025-12-04T12:46:35.5023485Z * [new branch] gh/mlazos/67/head -> origin/gh/mlazos/67/head 2025-12-04T12:46:35.5023552Z * [new branch] gh/mlazos/67/orig -> origin/gh/mlazos/67/orig 2025-12-04T12:46:35.5023619Z * [new branch] gh/mlazos/68/base -> origin/gh/mlazos/68/base 2025-12-04T12:46:35.5023686Z * [new branch] gh/mlazos/68/head -> origin/gh/mlazos/68/head 2025-12-04T12:46:35.5023753Z * [new branch] gh/mlazos/68/orig -> origin/gh/mlazos/68/orig 2025-12-04T12:46:35.5023818Z * [new branch] gh/mlazos/69/base -> origin/gh/mlazos/69/base 2025-12-04T12:46:35.5023884Z * [new branch] gh/mlazos/69/head -> origin/gh/mlazos/69/head 2025-12-04T12:46:35.5023950Z * [new branch] gh/mlazos/69/orig -> origin/gh/mlazos/69/orig 2025-12-04T12:46:35.5024017Z * [new branch] gh/mlazos/70/base -> origin/gh/mlazos/70/base 2025-12-04T12:46:35.5024083Z * [new branch] gh/mlazos/70/head -> origin/gh/mlazos/70/head 2025-12-04T12:46:35.5024150Z * [new branch] gh/mlazos/70/orig -> origin/gh/mlazos/70/orig 2025-12-04T12:46:35.5024215Z * [new branch] gh/mlazos/71/base -> origin/gh/mlazos/71/base 2025-12-04T12:46:35.5024283Z * [new branch] gh/mlazos/71/head -> origin/gh/mlazos/71/head 2025-12-04T12:46:35.5024349Z * [new branch] gh/mlazos/71/orig -> origin/gh/mlazos/71/orig 2025-12-04T12:46:35.5024415Z * [new branch] gh/mlazos/72/base -> origin/gh/mlazos/72/base 2025-12-04T12:46:35.5024481Z * [new branch] gh/mlazos/72/head -> origin/gh/mlazos/72/head 2025-12-04T12:46:35.5024548Z * [new branch] gh/mlazos/72/orig -> origin/gh/mlazos/72/orig 2025-12-04T12:46:35.5024614Z * [new branch] gh/mlazos/73/base -> origin/gh/mlazos/73/base 2025-12-04T12:46:35.5024681Z * [new branch] gh/mlazos/73/head -> origin/gh/mlazos/73/head 2025-12-04T12:46:35.5024748Z * [new branch] gh/mlazos/73/orig -> origin/gh/mlazos/73/orig 2025-12-04T12:46:35.5024815Z * [new branch] gh/mrmiywj/1/base -> origin/gh/mrmiywj/1/base 2025-12-04T12:46:35.5024883Z * [new branch] gh/mrmiywj/1/head -> origin/gh/mrmiywj/1/head 2025-12-04T12:46:35.5024957Z * [new branch] gh/muchulee8/73/base -> origin/gh/muchulee8/73/base 2025-12-04T12:46:35.5025029Z * [new branch] gh/muchulee8/73/head -> origin/gh/muchulee8/73/head 2025-12-04T12:46:35.5025100Z * [new branch] gh/muchulee8/73/orig -> origin/gh/muchulee8/73/orig 2025-12-04T12:46:35.5025186Z * [new branch] gh/naveenthangudu/1/base -> origin/gh/naveenthangudu/1/base 2025-12-04T12:46:35.5025290Z * [new branch] gh/naveenthangudu/1/head -> origin/gh/naveenthangudu/1/head 2025-12-04T12:46:35.5025373Z * [new branch] gh/naveenthangudu/1/orig -> origin/gh/naveenthangudu/1/orig 2025-12-04T12:46:35.5025452Z * [new branch] gh/naveenthangudu/2/base -> origin/gh/naveenthangudu/2/base 2025-12-04T12:46:35.5025531Z * [new branch] gh/naveenthangudu/2/head -> origin/gh/naveenthangudu/2/head 2025-12-04T12:46:35.5025641Z * [new branch] gh/naveenthangudu/2/orig -> origin/gh/naveenthangudu/2/orig 2025-12-04T12:46:35.5025720Z * [new branch] gh/naveenthangudu/3/base -> origin/gh/naveenthangudu/3/base 2025-12-04T12:46:35.5025800Z * [new branch] gh/naveenthangudu/3/head -> origin/gh/naveenthangudu/3/head 2025-12-04T12:46:35.5025880Z * [new branch] gh/naveenthangudu/3/orig -> origin/gh/naveenthangudu/3/orig 2025-12-04T12:46:35.5025960Z * [new branch] gh/naveenthangudu/4/base -> origin/gh/naveenthangudu/4/base 2025-12-04T12:46:35.5026041Z * [new branch] gh/naveenthangudu/4/head -> origin/gh/naveenthangudu/4/head 2025-12-04T12:46:35.5026121Z * [new branch] gh/naveenthangudu/4/orig -> origin/gh/naveenthangudu/4/orig 2025-12-04T12:46:35.5026200Z * [new branch] gh/naveenthangudu/5/base -> origin/gh/naveenthangudu/5/base 2025-12-04T12:46:35.5026281Z * [new branch] gh/naveenthangudu/5/head -> origin/gh/naveenthangudu/5/head 2025-12-04T12:46:35.5026361Z * [new branch] gh/naveenthangudu/5/orig -> origin/gh/naveenthangudu/5/orig 2025-12-04T12:46:35.5026440Z * [new branch] gh/naveenthangudu/6/base -> origin/gh/naveenthangudu/6/base 2025-12-04T12:46:35.5026519Z * [new branch] gh/naveenthangudu/6/head -> origin/gh/naveenthangudu/6/head 2025-12-04T12:46:35.5026599Z * [new branch] gh/naveenthangudu/6/orig -> origin/gh/naveenthangudu/6/orig 2025-12-04T12:46:35.5026679Z * [new branch] gh/naveenthangudu/7/base -> origin/gh/naveenthangudu/7/base 2025-12-04T12:46:35.5026759Z * [new branch] gh/naveenthangudu/7/head -> origin/gh/naveenthangudu/7/head 2025-12-04T12:46:35.5026839Z * [new branch] gh/naveenthangudu/7/orig -> origin/gh/naveenthangudu/7/orig 2025-12-04T12:46:35.5026918Z * [new branch] gh/naveenthangudu/8/base -> origin/gh/naveenthangudu/8/base 2025-12-04T12:46:35.5027001Z * [new branch] gh/naveenthangudu/8/head -> origin/gh/naveenthangudu/8/head 2025-12-04T12:46:35.5027080Z * [new branch] gh/naveenthangudu/8/orig -> origin/gh/naveenthangudu/8/orig 2025-12-04T12:46:35.5027160Z * [new branch] gh/naveenthangudu/9/base -> origin/gh/naveenthangudu/9/base 2025-12-04T12:46:35.5027239Z * [new branch] gh/naveenthangudu/9/head -> origin/gh/naveenthangudu/9/head 2025-12-04T12:46:35.5027318Z * [new branch] gh/naveenthangudu/9/orig -> origin/gh/naveenthangudu/9/orig 2025-12-04T12:46:35.5027391Z * [new branch] gh/nikitaved/1/base -> origin/gh/nikitaved/1/base 2025-12-04T12:46:35.5027464Z * [new branch] gh/nikitaved/1/head -> origin/gh/nikitaved/1/head 2025-12-04T12:46:35.5027574Z * [new branch] gh/nikitaved/1/orig -> origin/gh/nikitaved/1/orig 2025-12-04T12:46:35.5027650Z * [new branch] gh/nikitaved/10/base -> origin/gh/nikitaved/10/base 2025-12-04T12:46:35.5027723Z * [new branch] gh/nikitaved/10/head -> origin/gh/nikitaved/10/head 2025-12-04T12:46:35.5027795Z * [new branch] gh/nikitaved/10/orig -> origin/gh/nikitaved/10/orig 2025-12-04T12:46:35.5027866Z * [new branch] gh/nikitaved/11/base -> origin/gh/nikitaved/11/base 2025-12-04T12:46:35.5027937Z * [new branch] gh/nikitaved/11/head -> origin/gh/nikitaved/11/head 2025-12-04T12:46:35.5028007Z * [new branch] gh/nikitaved/11/orig -> origin/gh/nikitaved/11/orig 2025-12-04T12:46:35.5028127Z * [new branch] gh/nikitaved/12/base -> origin/gh/nikitaved/12/base 2025-12-04T12:46:35.5028199Z * [new branch] gh/nikitaved/12/head -> origin/gh/nikitaved/12/head 2025-12-04T12:46:35.5028269Z * [new branch] gh/nikitaved/12/orig -> origin/gh/nikitaved/12/orig 2025-12-04T12:46:35.5028375Z * [new branch] gh/nikitaved/13/base -> origin/gh/nikitaved/13/base 2025-12-04T12:46:35.5028446Z * [new branch] gh/nikitaved/13/head -> origin/gh/nikitaved/13/head 2025-12-04T12:46:35.5028516Z * [new branch] gh/nikitaved/13/orig -> origin/gh/nikitaved/13/orig 2025-12-04T12:46:35.5028588Z * [new branch] gh/nikitaved/14/base -> origin/gh/nikitaved/14/base 2025-12-04T12:46:35.5028657Z * [new branch] gh/nikitaved/14/head -> origin/gh/nikitaved/14/head 2025-12-04T12:46:35.5028728Z * [new branch] gh/nikitaved/14/orig -> origin/gh/nikitaved/14/orig 2025-12-04T12:46:35.5028801Z * [new branch] gh/nikitaved/15/base -> origin/gh/nikitaved/15/base 2025-12-04T12:46:35.5028871Z * [new branch] gh/nikitaved/15/head -> origin/gh/nikitaved/15/head 2025-12-04T12:46:35.5028943Z * [new branch] gh/nikitaved/15/orig -> origin/gh/nikitaved/15/orig 2025-12-04T12:46:35.5029017Z * [new branch] gh/nikitaved/16/base -> origin/gh/nikitaved/16/base 2025-12-04T12:46:35.5029087Z * [new branch] gh/nikitaved/16/head -> origin/gh/nikitaved/16/head 2025-12-04T12:46:35.5029158Z * [new branch] gh/nikitaved/16/orig -> origin/gh/nikitaved/16/orig 2025-12-04T12:46:35.5029229Z * [new branch] gh/nikitaved/2/base -> origin/gh/nikitaved/2/base 2025-12-04T12:46:35.5029300Z * [new branch] gh/nikitaved/2/head -> origin/gh/nikitaved/2/head 2025-12-04T12:46:35.5029371Z * [new branch] gh/nikitaved/2/orig -> origin/gh/nikitaved/2/orig 2025-12-04T12:46:35.5029443Z * [new branch] gh/nikitaved/4/base -> origin/gh/nikitaved/4/base 2025-12-04T12:46:35.5029513Z * [new branch] gh/nikitaved/4/head -> origin/gh/nikitaved/4/head 2025-12-04T12:46:35.5029582Z * [new branch] gh/nikitaved/4/orig -> origin/gh/nikitaved/4/orig 2025-12-04T12:46:35.5029654Z * [new branch] gh/nikitaved/5/base -> origin/gh/nikitaved/5/base 2025-12-04T12:46:35.5029725Z * [new branch] gh/nikitaved/5/head -> origin/gh/nikitaved/5/head 2025-12-04T12:46:35.5029795Z * [new branch] gh/nikitaved/5/orig -> origin/gh/nikitaved/5/orig 2025-12-04T12:46:35.5029864Z * [new branch] gh/nikitaved/6/base -> origin/gh/nikitaved/6/base 2025-12-04T12:46:35.5029933Z * [new branch] gh/nikitaved/6/head -> origin/gh/nikitaved/6/head 2025-12-04T12:46:35.5030002Z * [new branch] gh/nikitaved/6/orig -> origin/gh/nikitaved/6/orig 2025-12-04T12:46:35.5030074Z * [new branch] gh/nikitaved/8/base -> origin/gh/nikitaved/8/base 2025-12-04T12:46:35.5030143Z * [new branch] gh/nikitaved/8/head -> origin/gh/nikitaved/8/head 2025-12-04T12:46:35.5030214Z * [new branch] gh/nikitaved/8/orig -> origin/gh/nikitaved/8/orig 2025-12-04T12:46:35.5030286Z * [new branch] gh/nikitaved/9/base -> origin/gh/nikitaved/9/base 2025-12-04T12:46:35.5030356Z * [new branch] gh/nikitaved/9/head -> origin/gh/nikitaved/9/head 2025-12-04T12:46:35.5030425Z * [new branch] gh/nikitaved/9/orig -> origin/gh/nikitaved/9/orig 2025-12-04T12:46:35.5030493Z * [new branch] gh/oulgen/10/base -> origin/gh/oulgen/10/base 2025-12-04T12:46:35.5030560Z * [new branch] gh/oulgen/10/head -> origin/gh/oulgen/10/head 2025-12-04T12:46:35.5030627Z * [new branch] gh/oulgen/10/orig -> origin/gh/oulgen/10/orig 2025-12-04T12:46:35.5030725Z * [new branch] gh/oulgen/11/base -> origin/gh/oulgen/11/base 2025-12-04T12:46:35.5030792Z * [new branch] gh/oulgen/11/head -> origin/gh/oulgen/11/head 2025-12-04T12:46:35.5030859Z * [new branch] gh/oulgen/11/orig -> origin/gh/oulgen/11/orig 2025-12-04T12:46:35.5030945Z * [new branch] gh/oulgen/12/base -> origin/gh/oulgen/12/base 2025-12-04T12:46:35.5031011Z * [new branch] gh/oulgen/12/head -> origin/gh/oulgen/12/head 2025-12-04T12:46:35.5031077Z * [new branch] gh/oulgen/12/orig -> origin/gh/oulgen/12/orig 2025-12-04T12:46:35.5031143Z * [new branch] gh/oulgen/13/base -> origin/gh/oulgen/13/base 2025-12-04T12:46:35.5031209Z * [new branch] gh/oulgen/13/head -> origin/gh/oulgen/13/head 2025-12-04T12:46:35.5031276Z * [new branch] gh/oulgen/13/orig -> origin/gh/oulgen/13/orig 2025-12-04T12:46:35.5031342Z * [new branch] gh/oulgen/14/base -> origin/gh/oulgen/14/base 2025-12-04T12:46:35.5031408Z * [new branch] gh/oulgen/14/head -> origin/gh/oulgen/14/head 2025-12-04T12:46:35.5031475Z * [new branch] gh/oulgen/14/orig -> origin/gh/oulgen/14/orig 2025-12-04T12:46:35.5031542Z * [new branch] gh/oulgen/15/base -> origin/gh/oulgen/15/base 2025-12-04T12:46:35.5031608Z * [new branch] gh/oulgen/15/head -> origin/gh/oulgen/15/head 2025-12-04T12:46:35.5031675Z * [new branch] gh/oulgen/15/orig -> origin/gh/oulgen/15/orig 2025-12-04T12:46:35.5031740Z * [new branch] gh/oulgen/16/base -> origin/gh/oulgen/16/base 2025-12-04T12:46:35.5031807Z * [new branch] gh/oulgen/16/head -> origin/gh/oulgen/16/head 2025-12-04T12:46:35.5031873Z * [new branch] gh/oulgen/16/orig -> origin/gh/oulgen/16/orig 2025-12-04T12:46:35.5031941Z * [new branch] gh/oulgen/17/base -> origin/gh/oulgen/17/base 2025-12-04T12:46:35.5032009Z * [new branch] gh/oulgen/17/head -> origin/gh/oulgen/17/head 2025-12-04T12:46:35.5032075Z * [new branch] gh/oulgen/17/orig -> origin/gh/oulgen/17/orig 2025-12-04T12:46:35.5032140Z * [new branch] gh/oulgen/18/base -> origin/gh/oulgen/18/base 2025-12-04T12:46:35.5032208Z * [new branch] gh/oulgen/18/head -> origin/gh/oulgen/18/head 2025-12-04T12:46:35.5032274Z * [new branch] gh/oulgen/18/orig -> origin/gh/oulgen/18/orig 2025-12-04T12:46:35.5032341Z * [new branch] gh/oulgen/19/base -> origin/gh/oulgen/19/base 2025-12-04T12:46:35.5032408Z * [new branch] gh/oulgen/19/head -> origin/gh/oulgen/19/head 2025-12-04T12:46:35.5032474Z * [new branch] gh/oulgen/19/orig -> origin/gh/oulgen/19/orig 2025-12-04T12:46:35.5032541Z * [new branch] gh/oulgen/20/base -> origin/gh/oulgen/20/base 2025-12-04T12:46:35.5032607Z * [new branch] gh/oulgen/20/head -> origin/gh/oulgen/20/head 2025-12-04T12:46:35.5032673Z * [new branch] gh/oulgen/20/orig -> origin/gh/oulgen/20/orig 2025-12-04T12:46:35.5032738Z * [new branch] gh/oulgen/21/base -> origin/gh/oulgen/21/base 2025-12-04T12:46:35.5032807Z * [new branch] gh/oulgen/21/head -> origin/gh/oulgen/21/head 2025-12-04T12:46:35.5032872Z * [new branch] gh/oulgen/21/orig -> origin/gh/oulgen/21/orig 2025-12-04T12:46:35.5032937Z * [new branch] gh/oulgen/22/base -> origin/gh/oulgen/22/base 2025-12-04T12:46:35.5033004Z * [new branch] gh/oulgen/22/head -> origin/gh/oulgen/22/head 2025-12-04T12:46:35.5033070Z * [new branch] gh/oulgen/22/orig -> origin/gh/oulgen/22/orig 2025-12-04T12:46:35.5033160Z * [new branch] gh/oulgen/23/base -> origin/gh/oulgen/23/base 2025-12-04T12:46:35.5033228Z * [new branch] gh/oulgen/23/head -> origin/gh/oulgen/23/head 2025-12-04T12:46:35.5033293Z * [new branch] gh/oulgen/23/orig -> origin/gh/oulgen/23/orig 2025-12-04T12:46:35.5033360Z * [new branch] gh/oulgen/24/base -> origin/gh/oulgen/24/base 2025-12-04T12:46:35.5033452Z * [new branch] gh/oulgen/24/head -> origin/gh/oulgen/24/head 2025-12-04T12:46:35.5033518Z * [new branch] gh/oulgen/24/orig -> origin/gh/oulgen/24/orig 2025-12-04T12:46:35.5033584Z * [new branch] gh/oulgen/25/base -> origin/gh/oulgen/25/base 2025-12-04T12:46:35.5033650Z * [new branch] gh/oulgen/25/head -> origin/gh/oulgen/25/head 2025-12-04T12:46:35.5033715Z * [new branch] gh/oulgen/25/orig -> origin/gh/oulgen/25/orig 2025-12-04T12:46:35.5033783Z * [new branch] gh/oulgen/26/base -> origin/gh/oulgen/26/base 2025-12-04T12:46:35.5033849Z * [new branch] gh/oulgen/26/head -> origin/gh/oulgen/26/head 2025-12-04T12:46:35.5033915Z * [new branch] gh/oulgen/26/orig -> origin/gh/oulgen/26/orig 2025-12-04T12:46:35.5033982Z * [new branch] gh/oulgen/4/base -> origin/gh/oulgen/4/base 2025-12-04T12:46:35.5034051Z * [new branch] gh/oulgen/4/head -> origin/gh/oulgen/4/head 2025-12-04T12:46:35.5034117Z * [new branch] gh/oulgen/4/orig -> origin/gh/oulgen/4/orig 2025-12-04T12:46:35.5034182Z * [new branch] gh/oulgen/7/base -> origin/gh/oulgen/7/base 2025-12-04T12:46:35.5034248Z * [new branch] gh/oulgen/7/head -> origin/gh/oulgen/7/head 2025-12-04T12:46:35.5034312Z * [new branch] gh/oulgen/7/orig -> origin/gh/oulgen/7/orig 2025-12-04T12:46:35.5034381Z * [new branch] gh/oulgen/8/base -> origin/gh/oulgen/8/base 2025-12-04T12:46:35.5034447Z * [new branch] gh/oulgen/8/head -> origin/gh/oulgen/8/head 2025-12-04T12:46:35.5034511Z * [new branch] gh/oulgen/8/orig -> origin/gh/oulgen/8/orig 2025-12-04T12:46:35.5034576Z * [new branch] gh/oulgen/9/base -> origin/gh/oulgen/9/base 2025-12-04T12:46:35.5034641Z * [new branch] gh/oulgen/9/head -> origin/gh/oulgen/9/head 2025-12-04T12:46:35.5034706Z * [new branch] gh/oulgen/9/orig -> origin/gh/oulgen/9/orig 2025-12-04T12:46:35.5034808Z * [new branch] gh/patvig/mtia-serialization -> origin/gh/patvig/mtia-serialization 2025-12-04T12:46:35.5034876Z * [new branch] gh/pearu/108/base -> origin/gh/pearu/108/base 2025-12-04T12:46:35.5034943Z * [new branch] gh/pearu/108/head -> origin/gh/pearu/108/head 2025-12-04T12:46:35.5035011Z * [new branch] gh/pearu/108/orig -> origin/gh/pearu/108/orig 2025-12-04T12:46:35.5035079Z * [new branch] gh/pearu/109/base -> origin/gh/pearu/109/base 2025-12-04T12:46:35.5035147Z * [new branch] gh/pearu/109/head -> origin/gh/pearu/109/head 2025-12-04T12:46:35.5035212Z * [new branch] gh/pearu/109/orig -> origin/gh/pearu/109/orig 2025-12-04T12:46:35.5035280Z * [new branch] gh/pearu/110/base -> origin/gh/pearu/110/base 2025-12-04T12:46:35.5035347Z * [new branch] gh/pearu/110/head -> origin/gh/pearu/110/head 2025-12-04T12:46:35.5035413Z * [new branch] gh/pearu/110/orig -> origin/gh/pearu/110/orig 2025-12-04T12:46:35.5035479Z * [new branch] gh/pearu/111/base -> origin/gh/pearu/111/base 2025-12-04T12:46:35.5035546Z * [new branch] gh/pearu/111/head -> origin/gh/pearu/111/head 2025-12-04T12:46:35.5035612Z * [new branch] gh/pearu/111/orig -> origin/gh/pearu/111/orig 2025-12-04T12:46:35.5035702Z * [new branch] gh/pearu/112/base -> origin/gh/pearu/112/base 2025-12-04T12:46:35.5035771Z * [new branch] gh/pearu/112/head -> origin/gh/pearu/112/head 2025-12-04T12:46:35.5035838Z * [new branch] gh/pearu/112/orig -> origin/gh/pearu/112/orig 2025-12-04T12:46:35.5035932Z * [new branch] gh/pearu/115/base -> origin/gh/pearu/115/base 2025-12-04T12:46:35.5036004Z * [new branch] gh/pearu/115/head -> origin/gh/pearu/115/head 2025-12-04T12:46:35.5036072Z * [new branch] gh/pearu/115/orig -> origin/gh/pearu/115/orig 2025-12-04T12:46:35.5036139Z * [new branch] gh/pearu/116/base -> origin/gh/pearu/116/base 2025-12-04T12:46:35.5036208Z * [new branch] gh/pearu/116/head -> origin/gh/pearu/116/head 2025-12-04T12:46:35.5036277Z * [new branch] gh/pearu/116/orig -> origin/gh/pearu/116/orig 2025-12-04T12:46:35.5036349Z * [new branch] gh/pearu/117/base -> origin/gh/pearu/117/base 2025-12-04T12:46:35.5036420Z * [new branch] gh/pearu/117/head -> origin/gh/pearu/117/head 2025-12-04T12:46:35.5036488Z * [new branch] gh/pearu/117/orig -> origin/gh/pearu/117/orig 2025-12-04T12:46:35.5036556Z * [new branch] gh/pearu/118/base -> origin/gh/pearu/118/base 2025-12-04T12:46:35.5036628Z * [new branch] gh/pearu/118/head -> origin/gh/pearu/118/head 2025-12-04T12:46:35.5036694Z * [new branch] gh/pearu/118/orig -> origin/gh/pearu/118/orig 2025-12-04T12:46:35.5036763Z * [new branch] gh/pearu/119/base -> origin/gh/pearu/119/base 2025-12-04T12:46:35.5036832Z * [new branch] gh/pearu/119/head -> origin/gh/pearu/119/head 2025-12-04T12:46:35.5036901Z * [new branch] gh/pearu/119/orig -> origin/gh/pearu/119/orig 2025-12-04T12:46:35.5036974Z * [new branch] gh/pearu/139/base -> origin/gh/pearu/139/base 2025-12-04T12:46:35.5037042Z * [new branch] gh/pearu/139/head -> origin/gh/pearu/139/head 2025-12-04T12:46:35.5037109Z * [new branch] gh/pearu/139/orig -> origin/gh/pearu/139/orig 2025-12-04T12:46:35.5037178Z * [new branch] gh/pearu/140/base -> origin/gh/pearu/140/base 2025-12-04T12:46:35.5037248Z * [new branch] gh/pearu/140/head -> origin/gh/pearu/140/head 2025-12-04T12:46:35.5037316Z * [new branch] gh/pearu/140/orig -> origin/gh/pearu/140/orig 2025-12-04T12:46:35.5037388Z * [new branch] gh/pearu/142/base -> origin/gh/pearu/142/base 2025-12-04T12:46:35.5037455Z * [new branch] gh/pearu/142/head -> origin/gh/pearu/142/head 2025-12-04T12:46:35.5037563Z * [new branch] gh/pearu/142/orig -> origin/gh/pearu/142/orig 2025-12-04T12:46:35.5037636Z * [new branch] gh/pearu/143/base -> origin/gh/pearu/143/base 2025-12-04T12:46:35.5043663Z * [new branch] gh/pearu/143/head -> origin/gh/pearu/143/head 2025-12-04T12:46:35.5043732Z * [new branch] gh/pearu/143/orig -> origin/gh/pearu/143/orig 2025-12-04T12:46:35.5043799Z * [new branch] gh/pearu/147/base -> origin/gh/pearu/147/base 2025-12-04T12:46:35.5043871Z * [new branch] gh/pearu/147/head -> origin/gh/pearu/147/head 2025-12-04T12:46:35.5043939Z * [new branch] gh/pearu/147/orig -> origin/gh/pearu/147/orig 2025-12-04T12:46:35.5044010Z * [new branch] gh/pearu/149/base -> origin/gh/pearu/149/base 2025-12-04T12:46:35.5044078Z * [new branch] gh/pearu/149/head -> origin/gh/pearu/149/head 2025-12-04T12:46:35.5044145Z * [new branch] gh/pearu/149/orig -> origin/gh/pearu/149/orig 2025-12-04T12:46:35.5044277Z * [new branch] gh/pearu/150/base -> origin/gh/pearu/150/base 2025-12-04T12:46:35.5044345Z * [new branch] gh/pearu/150/head -> origin/gh/pearu/150/head 2025-12-04T12:46:35.5044411Z * [new branch] gh/pearu/150/orig -> origin/gh/pearu/150/orig 2025-12-04T12:46:35.5044480Z * [new branch] gh/pearu/151/base -> origin/gh/pearu/151/base 2025-12-04T12:46:35.5044607Z * [new branch] gh/pearu/151/head -> origin/gh/pearu/151/head 2025-12-04T12:46:35.5044674Z * [new branch] gh/pearu/151/orig -> origin/gh/pearu/151/orig 2025-12-04T12:46:35.5044743Z * [new branch] gh/pearu/152/base -> origin/gh/pearu/152/base 2025-12-04T12:46:35.5044809Z * [new branch] gh/pearu/152/head -> origin/gh/pearu/152/head 2025-12-04T12:46:35.5044875Z * [new branch] gh/pearu/152/orig -> origin/gh/pearu/152/orig 2025-12-04T12:46:35.5044946Z * [new branch] gh/pearu/153/base -> origin/gh/pearu/153/base 2025-12-04T12:46:35.5045014Z * [new branch] gh/pearu/153/head -> origin/gh/pearu/153/head 2025-12-04T12:46:35.5045080Z * [new branch] gh/pearu/153/orig -> origin/gh/pearu/153/orig 2025-12-04T12:46:35.5045147Z * [new branch] gh/pearu/154/base -> origin/gh/pearu/154/base 2025-12-04T12:46:35.5045217Z * [new branch] gh/pearu/154/head -> origin/gh/pearu/154/head 2025-12-04T12:46:35.5045283Z * [new branch] gh/pearu/154/orig -> origin/gh/pearu/154/orig 2025-12-04T12:46:35.5045351Z * [new branch] gh/pearu/155/base -> origin/gh/pearu/155/base 2025-12-04T12:46:35.5045417Z * [new branch] gh/pearu/155/head -> origin/gh/pearu/155/head 2025-12-04T12:46:35.5045485Z * [new branch] gh/pearu/155/orig -> origin/gh/pearu/155/orig 2025-12-04T12:46:35.5045552Z * [new branch] gh/pearu/156/base -> origin/gh/pearu/156/base 2025-12-04T12:46:35.5045622Z * [new branch] gh/pearu/156/head -> origin/gh/pearu/156/head 2025-12-04T12:46:35.5045689Z * [new branch] gh/pearu/156/orig -> origin/gh/pearu/156/orig 2025-12-04T12:46:35.5045756Z * [new branch] gh/pearu/56/base -> origin/gh/pearu/56/base 2025-12-04T12:46:35.5045824Z * [new branch] gh/pearu/56/head -> origin/gh/pearu/56/head 2025-12-04T12:46:35.5045892Z * [new branch] gh/pearu/56/orig -> origin/gh/pearu/56/orig 2025-12-04T12:46:35.5045958Z * [new branch] gh/pearu/97/base -> origin/gh/pearu/97/base 2025-12-04T12:46:35.5046022Z * [new branch] gh/pearu/97/head -> origin/gh/pearu/97/head 2025-12-04T12:46:35.5046089Z * [new branch] gh/pearu/97/orig -> origin/gh/pearu/97/orig 2025-12-04T12:46:35.5046165Z * [new branch] gh/pianpwk/21/base -> origin/gh/pianpwk/21/base 2025-12-04T12:46:35.5046238Z * [new branch] gh/pianpwk/21/head -> origin/gh/pianpwk/21/head 2025-12-04T12:46:35.5046309Z * [new branch] gh/pianpwk/28/base -> origin/gh/pianpwk/28/base 2025-12-04T12:46:35.5046378Z * [new branch] gh/pianpwk/28/head -> origin/gh/pianpwk/28/head 2025-12-04T12:46:35.5046448Z * [new branch] gh/pianpwk/28/orig -> origin/gh/pianpwk/28/orig 2025-12-04T12:46:35.5046518Z * [new branch] gh/pianpwk/29/base -> origin/gh/pianpwk/29/base 2025-12-04T12:46:35.5046587Z * [new branch] gh/pianpwk/29/head -> origin/gh/pianpwk/29/head 2025-12-04T12:46:35.5046655Z * [new branch] gh/pianpwk/29/orig -> origin/gh/pianpwk/29/orig 2025-12-04T12:46:35.5046724Z * [new branch] gh/pianpwk/30/base -> origin/gh/pianpwk/30/base 2025-12-04T12:46:35.5046793Z * [new branch] gh/pianpwk/30/head -> origin/gh/pianpwk/30/head 2025-12-04T12:46:35.5046888Z * [new branch] gh/pianpwk/30/orig -> origin/gh/pianpwk/30/orig 2025-12-04T12:46:35.5046961Z * [new branch] gh/pianpwk/31/base -> origin/gh/pianpwk/31/base 2025-12-04T12:46:35.5047030Z * [new branch] gh/pianpwk/31/head -> origin/gh/pianpwk/31/head 2025-12-04T12:46:35.5047121Z * [new branch] gh/pianpwk/31/orig -> origin/gh/pianpwk/31/orig 2025-12-04T12:46:35.5047190Z * [new branch] gh/pianpwk/32/base -> origin/gh/pianpwk/32/base 2025-12-04T12:46:35.5047260Z * [new branch] gh/pianpwk/32/head -> origin/gh/pianpwk/32/head 2025-12-04T12:46:35.5047329Z * [new branch] gh/pianpwk/32/orig -> origin/gh/pianpwk/32/orig 2025-12-04T12:46:35.5047398Z * [new branch] gh/pianpwk/33/base -> origin/gh/pianpwk/33/base 2025-12-04T12:46:35.5047467Z * [new branch] gh/pianpwk/33/head -> origin/gh/pianpwk/33/head 2025-12-04T12:46:35.5047572Z * [new branch] gh/pianpwk/33/orig -> origin/gh/pianpwk/33/orig 2025-12-04T12:46:35.5047643Z * [new branch] gh/pianpwk/34/base -> origin/gh/pianpwk/34/base 2025-12-04T12:46:35.5047712Z * [new branch] gh/pianpwk/34/head -> origin/gh/pianpwk/34/head 2025-12-04T12:46:35.5047782Z * [new branch] gh/pianpwk/34/orig -> origin/gh/pianpwk/34/orig 2025-12-04T12:46:35.5047852Z * [new branch] gh/pianpwk/35/base -> origin/gh/pianpwk/35/base 2025-12-04T12:46:35.5047922Z * [new branch] gh/pianpwk/35/head -> origin/gh/pianpwk/35/head 2025-12-04T12:46:35.5047991Z * [new branch] gh/pianpwk/35/orig -> origin/gh/pianpwk/35/orig 2025-12-04T12:46:35.5048057Z * [new branch] gh/rec/141/base -> origin/gh/rec/141/base 2025-12-04T12:46:35.5048121Z * [new branch] gh/rec/141/head -> origin/gh/rec/141/head 2025-12-04T12:46:35.5048188Z * [new branch] gh/rec/153/base -> origin/gh/rec/153/base 2025-12-04T12:46:35.5048250Z * [new branch] gh/rec/153/head -> origin/gh/rec/153/head 2025-12-04T12:46:35.5048314Z * [new branch] gh/rec/153/orig -> origin/gh/rec/153/orig 2025-12-04T12:46:35.5048378Z * [new branch] gh/rec/154/base -> origin/gh/rec/154/base 2025-12-04T12:46:35.5048442Z * [new branch] gh/rec/154/head -> origin/gh/rec/154/head 2025-12-04T12:46:35.5048505Z * [new branch] gh/rec/154/orig -> origin/gh/rec/154/orig 2025-12-04T12:46:35.5048573Z * [new branch] gh/rec/164/base -> origin/gh/rec/164/base 2025-12-04T12:46:35.5048635Z * [new branch] gh/rec/164/head -> origin/gh/rec/164/head 2025-12-04T12:46:35.5048698Z * [new branch] gh/rec/164/orig -> origin/gh/rec/164/orig 2025-12-04T12:46:35.5048762Z * [new branch] gh/rec/166/base -> origin/gh/rec/166/base 2025-12-04T12:46:35.5048824Z * [new branch] gh/rec/166/head -> origin/gh/rec/166/head 2025-12-04T12:46:35.5048887Z * [new branch] gh/rec/166/orig -> origin/gh/rec/166/orig 2025-12-04T12:46:35.5048950Z * [new branch] gh/rec/167/base -> origin/gh/rec/167/base 2025-12-04T12:46:35.5049016Z * [new branch] gh/rec/167/head -> origin/gh/rec/167/head 2025-12-04T12:46:35.5049081Z * [new branch] gh/rec/167/orig -> origin/gh/rec/167/orig 2025-12-04T12:46:35.5049142Z * [new branch] gh/rec/168/base -> origin/gh/rec/168/base 2025-12-04T12:46:35.5049204Z * [new branch] gh/rec/168/head -> origin/gh/rec/168/head 2025-12-04T12:46:35.5049268Z * [new branch] gh/rec/168/orig -> origin/gh/rec/168/orig 2025-12-04T12:46:35.5049330Z * [new branch] gh/rec/169/base -> origin/gh/rec/169/base 2025-12-04T12:46:35.5049437Z * [new branch] gh/rec/169/head -> origin/gh/rec/169/head 2025-12-04T12:46:35.5049501Z * [new branch] gh/rec/169/orig -> origin/gh/rec/169/orig 2025-12-04T12:46:35.5049563Z * [new branch] gh/rec/170/base -> origin/gh/rec/170/base 2025-12-04T12:46:35.5049668Z * [new branch] gh/rec/170/head -> origin/gh/rec/170/head 2025-12-04T12:46:35.5049732Z * [new branch] gh/rec/170/orig -> origin/gh/rec/170/orig 2025-12-04T12:46:35.5049794Z * [new branch] gh/rec/171/base -> origin/gh/rec/171/base 2025-12-04T12:46:35.5049856Z * [new branch] gh/rec/171/head -> origin/gh/rec/171/head 2025-12-04T12:46:35.5049920Z * [new branch] gh/rec/171/orig -> origin/gh/rec/171/orig 2025-12-04T12:46:35.5049982Z * [new branch] gh/rec/172/base -> origin/gh/rec/172/base 2025-12-04T12:46:35.5050047Z * [new branch] gh/rec/172/head -> origin/gh/rec/172/head 2025-12-04T12:46:35.5050113Z * [new branch] gh/rec/172/orig -> origin/gh/rec/172/orig 2025-12-04T12:46:35.5050174Z * [new branch] gh/rec/173/base -> origin/gh/rec/173/base 2025-12-04T12:46:35.5050237Z * [new branch] gh/rec/173/head -> origin/gh/rec/173/head 2025-12-04T12:46:35.5050304Z * [new branch] gh/rec/173/orig -> origin/gh/rec/173/orig 2025-12-04T12:46:35.5050366Z * [new branch] gh/rec/174/base -> origin/gh/rec/174/base 2025-12-04T12:46:35.5050429Z * [new branch] gh/rec/174/head -> origin/gh/rec/174/head 2025-12-04T12:46:35.5050492Z * [new branch] gh/rec/174/orig -> origin/gh/rec/174/orig 2025-12-04T12:46:35.5050554Z * [new branch] gh/rec/175/base -> origin/gh/rec/175/base 2025-12-04T12:46:35.5050616Z * [new branch] gh/rec/175/head -> origin/gh/rec/175/head 2025-12-04T12:46:35.5050680Z * [new branch] gh/rec/175/orig -> origin/gh/rec/175/orig 2025-12-04T12:46:35.5050741Z * [new branch] gh/rec/176/base -> origin/gh/rec/176/base 2025-12-04T12:46:35.5050804Z * [new branch] gh/rec/176/head -> origin/gh/rec/176/head 2025-12-04T12:46:35.5050868Z * [new branch] gh/rec/176/orig -> origin/gh/rec/176/orig 2025-12-04T12:46:35.5050930Z * [new branch] gh/rec/177/base -> origin/gh/rec/177/base 2025-12-04T12:46:35.5050993Z * [new branch] gh/rec/177/head -> origin/gh/rec/177/head 2025-12-04T12:46:35.5051055Z * [new branch] gh/rec/177/orig -> origin/gh/rec/177/orig 2025-12-04T12:46:35.5051118Z * [new branch] gh/rec/178/base -> origin/gh/rec/178/base 2025-12-04T12:46:35.5051183Z * [new branch] gh/rec/178/head -> origin/gh/rec/178/head 2025-12-04T12:46:35.5051249Z * [new branch] gh/rec/178/orig -> origin/gh/rec/178/orig 2025-12-04T12:46:35.5051341Z * [new branch] gh/robert-hardwick/3/base -> origin/gh/robert-hardwick/3/base 2025-12-04T12:46:35.5051429Z * [new branch] gh/robert-hardwick/3/head -> origin/gh/robert-hardwick/3/head 2025-12-04T12:46:35.5051513Z * [new branch] gh/robert-hardwick/3/orig -> origin/gh/robert-hardwick/3/orig 2025-12-04T12:46:35.5051596Z * [new branch] gh/robert-hardwick/4/base -> origin/gh/robert-hardwick/4/base 2025-12-04T12:46:35.5051678Z * [new branch] gh/robert-hardwick/4/head -> origin/gh/robert-hardwick/4/head 2025-12-04T12:46:35.5051760Z * [new branch] gh/robert-hardwick/4/orig -> origin/gh/robert-hardwick/4/orig 2025-12-04T12:46:35.5051842Z * [new branch] gh/robert-hardwick/5/base -> origin/gh/robert-hardwick/5/base 2025-12-04T12:46:35.5051951Z * [new branch] gh/robert-hardwick/5/head -> origin/gh/robert-hardwick/5/head 2025-12-04T12:46:35.5052034Z * [new branch] gh/robert-hardwick/5/orig -> origin/gh/robert-hardwick/5/orig 2025-12-04T12:46:35.5052116Z * [new branch] gh/robert-hardwick/6/base -> origin/gh/robert-hardwick/6/base 2025-12-04T12:46:35.5052221Z * [new branch] gh/robert-hardwick/6/head -> origin/gh/robert-hardwick/6/head 2025-12-04T12:46:35.5052303Z * [new branch] gh/robert-hardwick/6/orig -> origin/gh/robert-hardwick/6/orig 2025-12-04T12:46:35.5052388Z * [new branch] gh/robert-hardwick/7/base -> origin/gh/robert-hardwick/7/base 2025-12-04T12:46:35.5052469Z * [new branch] gh/robert-hardwick/7/head -> origin/gh/robert-hardwick/7/head 2025-12-04T12:46:35.5052552Z * [new branch] gh/robert-hardwick/7/orig -> origin/gh/robert-hardwick/7/orig 2025-12-04T12:46:35.5052635Z * [new branch] gh/robert-hardwick/8/base -> origin/gh/robert-hardwick/8/base 2025-12-04T12:46:35.5052718Z * [new branch] gh/robert-hardwick/8/head -> origin/gh/robert-hardwick/8/head 2025-12-04T12:46:35.5052800Z * [new branch] gh/robert-hardwick/8/orig -> origin/gh/robert-hardwick/8/orig 2025-12-04T12:46:35.5052883Z * [new branch] gh/robert-hardwick/9/base -> origin/gh/robert-hardwick/9/base 2025-12-04T12:46:35.5052966Z * [new branch] gh/robert-hardwick/9/head -> origin/gh/robert-hardwick/9/head 2025-12-04T12:46:35.5053048Z * [new branch] gh/robert-hardwick/9/orig -> origin/gh/robert-hardwick/9/orig 2025-12-04T12:46:35.5053119Z * [new branch] gh/rtimpe/1/base -> origin/gh/rtimpe/1/base 2025-12-04T12:46:35.5053190Z * [new branch] gh/rtimpe/1/head -> origin/gh/rtimpe/1/head 2025-12-04T12:46:35.5053256Z * [new branch] gh/rtimpe/2/base -> origin/gh/rtimpe/2/base 2025-12-04T12:46:35.5053326Z * [new branch] gh/rtimpe/2/head -> origin/gh/rtimpe/2/head 2025-12-04T12:46:35.5053394Z * [new branch] gh/rtimpe/22/base -> origin/gh/rtimpe/22/base 2025-12-04T12:46:35.5053462Z * [new branch] gh/rtimpe/22/head -> origin/gh/rtimpe/22/head 2025-12-04T12:46:35.5053531Z * [new branch] gh/rtimpe/22/orig -> origin/gh/rtimpe/22/orig 2025-12-04T12:46:35.5053601Z * [new branch] gh/rtimpe/23/base -> origin/gh/rtimpe/23/base 2025-12-04T12:46:35.5053787Z * [new branch] gh/rtimpe/23/head -> origin/gh/rtimpe/23/head 2025-12-04T12:46:35.5053856Z * [new branch] gh/rtimpe/23/orig -> origin/gh/rtimpe/23/orig 2025-12-04T12:46:35.5053923Z * [new branch] gh/rtimpe/24/base -> origin/gh/rtimpe/24/base 2025-12-04T12:46:35.5053989Z * [new branch] gh/rtimpe/24/head -> origin/gh/rtimpe/24/head 2025-12-04T12:46:35.5054058Z * [new branch] gh/rtimpe/24/orig -> origin/gh/rtimpe/24/orig 2025-12-04T12:46:35.5054124Z * [new branch] gh/rtimpe/25/base -> origin/gh/rtimpe/25/base 2025-12-04T12:46:35.5054193Z * [new branch] gh/rtimpe/25/head -> origin/gh/rtimpe/25/head 2025-12-04T12:46:35.5054259Z * [new branch] gh/rtimpe/25/orig -> origin/gh/rtimpe/25/orig 2025-12-04T12:46:35.5054327Z * [new branch] gh/rtimpe/26/base -> origin/gh/rtimpe/26/base 2025-12-04T12:46:35.5054395Z * [new branch] gh/rtimpe/26/head -> origin/gh/rtimpe/26/head 2025-12-04T12:46:35.5054460Z * [new branch] gh/rtimpe/26/orig -> origin/gh/rtimpe/26/orig 2025-12-04T12:46:35.5054526Z * [new branch] gh/rtimpe/27/base -> origin/gh/rtimpe/27/base 2025-12-04T12:46:35.5054594Z * [new branch] gh/rtimpe/27/head -> origin/gh/rtimpe/27/head 2025-12-04T12:46:35.5054661Z * [new branch] gh/rtimpe/27/orig -> origin/gh/rtimpe/27/orig 2025-12-04T12:46:35.5054752Z * [new branch] gh/rtimpe/28/base -> origin/gh/rtimpe/28/base 2025-12-04T12:46:35.5054820Z * [new branch] gh/rtimpe/28/head -> origin/gh/rtimpe/28/head 2025-12-04T12:46:35.5054886Z * [new branch] gh/rtimpe/28/orig -> origin/gh/rtimpe/28/orig 2025-12-04T12:46:35.5054988Z * [new branch] gh/rtimpe/29/base -> origin/gh/rtimpe/29/base 2025-12-04T12:46:35.5055055Z * [new branch] gh/rtimpe/29/head -> origin/gh/rtimpe/29/head 2025-12-04T12:46:35.5055121Z * [new branch] gh/rtimpe/29/orig -> origin/gh/rtimpe/29/orig 2025-12-04T12:46:35.5055188Z * [new branch] gh/rtimpe/3/base -> origin/gh/rtimpe/3/base 2025-12-04T12:46:35.5055254Z * [new branch] gh/rtimpe/3/head -> origin/gh/rtimpe/3/head 2025-12-04T12:46:35.5055320Z * [new branch] gh/rtimpe/30/base -> origin/gh/rtimpe/30/base 2025-12-04T12:46:35.5055388Z * [new branch] gh/rtimpe/30/head -> origin/gh/rtimpe/30/head 2025-12-04T12:46:35.5055455Z * [new branch] gh/rtimpe/30/orig -> origin/gh/rtimpe/30/orig 2025-12-04T12:46:35.5055520Z * [new branch] gh/rtimpe/31/base -> origin/gh/rtimpe/31/base 2025-12-04T12:46:35.5055591Z * [new branch] gh/rtimpe/31/head -> origin/gh/rtimpe/31/head 2025-12-04T12:46:35.5055657Z * [new branch] gh/rtimpe/31/orig -> origin/gh/rtimpe/31/orig 2025-12-04T12:46:35.5055723Z * [new branch] gh/rtimpe/32/base -> origin/gh/rtimpe/32/base 2025-12-04T12:46:35.5055791Z * [new branch] gh/rtimpe/32/head -> origin/gh/rtimpe/32/head 2025-12-04T12:46:35.5055856Z * [new branch] gh/rtimpe/32/orig -> origin/gh/rtimpe/32/orig 2025-12-04T12:46:35.5055922Z * [new branch] gh/rtimpe/33/base -> origin/gh/rtimpe/33/base 2025-12-04T12:46:35.5055991Z * [new branch] gh/rtimpe/33/head -> origin/gh/rtimpe/33/head 2025-12-04T12:46:35.5056057Z * [new branch] gh/rtimpe/33/orig -> origin/gh/rtimpe/33/orig 2025-12-04T12:46:35.5056122Z * [new branch] gh/rtimpe/34/base -> origin/gh/rtimpe/34/base 2025-12-04T12:46:35.5056190Z * [new branch] gh/rtimpe/34/head -> origin/gh/rtimpe/34/head 2025-12-04T12:46:35.5056256Z * [new branch] gh/rtimpe/34/orig -> origin/gh/rtimpe/34/orig 2025-12-04T12:46:35.5056322Z * [new branch] gh/rtimpe/35/base -> origin/gh/rtimpe/35/base 2025-12-04T12:46:35.5056388Z * [new branch] gh/rtimpe/35/head -> origin/gh/rtimpe/35/head 2025-12-04T12:46:35.5056454Z * [new branch] gh/rtimpe/35/orig -> origin/gh/rtimpe/35/orig 2025-12-04T12:46:35.5056520Z * [new branch] gh/rtimpe/4/base -> origin/gh/rtimpe/4/base 2025-12-04T12:46:35.5056589Z * [new branch] gh/rtimpe/4/head -> origin/gh/rtimpe/4/head 2025-12-04T12:46:35.5056671Z * [new branch] gh/ruisizhang123/1/base -> origin/gh/ruisizhang123/1/base 2025-12-04T12:46:35.5056750Z * [new branch] gh/ruisizhang123/1/head -> origin/gh/ruisizhang123/1/head 2025-12-04T12:46:35.5056831Z * [new branch] gh/ruisizhang123/1/orig -> origin/gh/ruisizhang123/1/orig 2025-12-04T12:46:35.5056907Z * [new branch] gh/ruisizhang123/4/base -> origin/gh/ruisizhang123/4/base 2025-12-04T12:46:35.5056982Z * [new branch] gh/ruisizhang123/4/head -> origin/gh/ruisizhang123/4/head 2025-12-04T12:46:35.5057059Z * [new branch] gh/ruisizhang123/4/orig -> origin/gh/ruisizhang123/4/orig 2025-12-04T12:46:35.5057133Z * [new branch] gh/ruisizhang123/5/base -> origin/gh/ruisizhang123/5/base 2025-12-04T12:46:35.5057209Z * [new branch] gh/ruisizhang123/5/head -> origin/gh/ruisizhang123/5/head 2025-12-04T12:46:35.5057308Z * [new branch] gh/ruisizhang123/5/orig -> origin/gh/ruisizhang123/5/orig 2025-12-04T12:46:35.5057383Z * [new branch] gh/ruisizhang123/6/base -> origin/gh/ruisizhang123/6/base 2025-12-04T12:46:35.5057460Z * [new branch] gh/ruisizhang123/6/head -> origin/gh/ruisizhang123/6/head 2025-12-04T12:46:35.5057625Z * [new branch] gh/ruisizhang123/6/orig -> origin/gh/ruisizhang123/6/orig 2025-12-04T12:46:35.5057699Z * [new branch] gh/ruisizhang123/7/base -> origin/gh/ruisizhang123/7/base 2025-12-04T12:46:35.5057775Z * [new branch] gh/ruisizhang123/7/head -> origin/gh/ruisizhang123/7/head 2025-12-04T12:46:35.5057850Z * [new branch] gh/ruisizhang123/7/orig -> origin/gh/ruisizhang123/7/orig 2025-12-04T12:46:35.5057925Z * [new branch] gh/ruisizhang123/8/base -> origin/gh/ruisizhang123/8/base 2025-12-04T12:46:35.5058004Z * [new branch] gh/ruisizhang123/8/head -> origin/gh/ruisizhang123/8/head 2025-12-04T12:46:35.5058078Z * [new branch] gh/ruisizhang123/8/orig -> origin/gh/ruisizhang123/8/orig 2025-12-04T12:46:35.5058153Z * [new branch] gh/ruisizhang123/9/base -> origin/gh/ruisizhang123/9/base 2025-12-04T12:46:35.5058230Z * [new branch] gh/ruisizhang123/9/head -> origin/gh/ruisizhang123/9/head 2025-12-04T12:46:35.5058307Z * [new branch] gh/ruisizhang123/9/orig -> origin/gh/ruisizhang123/9/orig 2025-12-04T12:46:35.5058384Z * [new branch] gh/seemethere/52/base -> origin/gh/seemethere/52/base 2025-12-04T12:46:35.5058461Z * [new branch] gh/seemethere/52/head -> origin/gh/seemethere/52/head 2025-12-04T12:46:35.5058534Z * [new branch] gh/seemethere/52/orig -> origin/gh/seemethere/52/orig 2025-12-04T12:46:35.5058608Z * [new branch] gh/seemethere/53/base -> origin/gh/seemethere/53/base 2025-12-04T12:46:35.5058684Z * [new branch] gh/seemethere/53/head -> origin/gh/seemethere/53/head 2025-12-04T12:46:35.5058758Z * [new branch] gh/seemethere/53/orig -> origin/gh/seemethere/53/orig 2025-12-04T12:46:35.5058830Z * [new branch] gh/seemethere/54/base -> origin/gh/seemethere/54/base 2025-12-04T12:46:35.5058903Z * [new branch] gh/seemethere/54/head -> origin/gh/seemethere/54/head 2025-12-04T12:46:35.5058981Z * [new branch] gh/seemethere/54/orig -> origin/gh/seemethere/54/orig 2025-12-04T12:46:35.5059054Z * [new branch] gh/seemethere/55/base -> origin/gh/seemethere/55/base 2025-12-04T12:46:35.5059127Z * [new branch] gh/seemethere/55/head -> origin/gh/seemethere/55/head 2025-12-04T12:46:35.5059200Z * [new branch] gh/seemethere/55/orig -> origin/gh/seemethere/55/orig 2025-12-04T12:46:35.5059273Z * [new branch] gh/seemethere/59/base -> origin/gh/seemethere/59/base 2025-12-04T12:46:35.5059347Z * [new branch] gh/seemethere/59/head -> origin/gh/seemethere/59/head 2025-12-04T12:46:35.5059418Z * [new branch] gh/seemethere/59/orig -> origin/gh/seemethere/59/orig 2025-12-04T12:46:35.5059492Z * [new branch] gh/seemethere/62/base -> origin/gh/seemethere/62/base 2025-12-04T12:46:35.5059566Z * [new branch] gh/seemethere/62/head -> origin/gh/seemethere/62/head 2025-12-04T12:46:35.5059639Z * [new branch] gh/seemethere/62/orig -> origin/gh/seemethere/62/orig 2025-12-04T12:46:35.5059715Z * [new branch] gh/seemethere/63/base -> origin/gh/seemethere/63/base 2025-12-04T12:46:35.5059787Z * [new branch] gh/seemethere/63/head -> origin/gh/seemethere/63/head 2025-12-04T12:46:35.5059859Z * [new branch] gh/seemethere/63/orig -> origin/gh/seemethere/63/orig 2025-12-04T12:46:35.5059933Z * [new branch] gh/seemethere/71/base -> origin/gh/seemethere/71/base 2025-12-04T12:46:35.5060038Z * [new branch] gh/seemethere/71/head -> origin/gh/seemethere/71/head 2025-12-04T12:46:35.5060112Z * [new branch] gh/seemethere/71/orig -> origin/gh/seemethere/71/orig 2025-12-04T12:46:35.5060186Z * [new branch] gh/seemethere/72/base -> origin/gh/seemethere/72/base 2025-12-04T12:46:35.5060292Z * [new branch] gh/seemethere/72/head -> origin/gh/seemethere/72/head 2025-12-04T12:46:35.5060365Z * [new branch] gh/seemethere/72/orig -> origin/gh/seemethere/72/orig 2025-12-04T12:46:35.5060438Z * [new branch] gh/seemethere/73/base -> origin/gh/seemethere/73/base 2025-12-04T12:46:35.5060510Z * [new branch] gh/seemethere/73/head -> origin/gh/seemethere/73/head 2025-12-04T12:46:35.5060583Z * [new branch] gh/seemethere/73/orig -> origin/gh/seemethere/73/orig 2025-12-04T12:46:35.5060655Z * [new branch] gh/seemethere/74/base -> origin/gh/seemethere/74/base 2025-12-04T12:46:35.5060728Z * [new branch] gh/seemethere/74/head -> origin/gh/seemethere/74/head 2025-12-04T12:46:35.5060801Z * [new branch] gh/seemethere/74/orig -> origin/gh/seemethere/74/orig 2025-12-04T12:46:35.5060874Z * [new branch] gh/seemethere/75/base -> origin/gh/seemethere/75/base 2025-12-04T12:46:35.5060949Z * [new branch] gh/seemethere/75/head -> origin/gh/seemethere/75/head 2025-12-04T12:46:35.5061023Z * [new branch] gh/seemethere/75/orig -> origin/gh/seemethere/75/orig 2025-12-04T12:46:35.5061094Z * [new branch] gh/seemethere/76/base -> origin/gh/seemethere/76/base 2025-12-04T12:46:35.5061168Z * [new branch] gh/seemethere/76/head -> origin/gh/seemethere/76/head 2025-12-04T12:46:35.5061241Z * [new branch] gh/seemethere/76/orig -> origin/gh/seemethere/76/orig 2025-12-04T12:46:35.5061318Z * [new branch] gh/shunting314/145/base -> origin/gh/shunting314/145/base 2025-12-04T12:46:35.5061393Z * [new branch] gh/shunting314/145/head -> origin/gh/shunting314/145/head 2025-12-04T12:46:35.5061469Z * [new branch] gh/shunting314/145/orig -> origin/gh/shunting314/145/orig 2025-12-04T12:46:35.5061543Z * [new branch] gh/shunting314/176/base -> origin/gh/shunting314/176/base 2025-12-04T12:46:35.5061619Z * [new branch] gh/shunting314/176/head -> origin/gh/shunting314/176/head 2025-12-04T12:46:35.5061693Z * [new branch] gh/shunting314/176/orig -> origin/gh/shunting314/176/orig 2025-12-04T12:46:35.5061767Z * [new branch] gh/shunting314/249/base -> origin/gh/shunting314/249/base 2025-12-04T12:46:35.5061841Z * [new branch] gh/shunting314/249/head -> origin/gh/shunting314/249/head 2025-12-04T12:46:35.5061917Z * [new branch] gh/shunting314/249/orig -> origin/gh/shunting314/249/orig 2025-12-04T12:46:35.5061993Z * [new branch] gh/shunting314/253/base -> origin/gh/shunting314/253/base 2025-12-04T12:46:35.5062068Z * [new branch] gh/shunting314/253/head -> origin/gh/shunting314/253/head 2025-12-04T12:46:35.5062140Z * [new branch] gh/shunting314/253/orig -> origin/gh/shunting314/253/orig 2025-12-04T12:46:35.5062215Z * [new branch] gh/shunting314/256/base -> origin/gh/shunting314/256/base 2025-12-04T12:46:35.5062291Z * [new branch] gh/shunting314/256/head -> origin/gh/shunting314/256/head 2025-12-04T12:46:35.5062365Z * [new branch] gh/shunting314/256/orig -> origin/gh/shunting314/256/orig 2025-12-04T12:46:35.5062439Z * [new branch] gh/shunting314/257/base -> origin/gh/shunting314/257/base 2025-12-04T12:46:35.5062512Z * [new branch] gh/shunting314/257/head -> origin/gh/shunting314/257/head 2025-12-04T12:46:35.5062586Z * [new branch] gh/shunting314/257/orig -> origin/gh/shunting314/257/orig 2025-12-04T12:46:35.5062682Z * [new branch] gh/shunting314/258/base -> origin/gh/shunting314/258/base 2025-12-04T12:46:35.5062757Z * [new branch] gh/shunting314/258/head -> origin/gh/shunting314/258/head 2025-12-04T12:46:35.5062830Z * [new branch] gh/shunting314/258/orig -> origin/gh/shunting314/258/orig 2025-12-04T12:46:35.5062928Z * [new branch] gh/shunting314/259/base -> origin/gh/shunting314/259/base 2025-12-04T12:46:35.5063003Z * [new branch] gh/shunting314/259/head -> origin/gh/shunting314/259/head 2025-12-04T12:46:35.5063076Z * [new branch] gh/shunting314/259/orig -> origin/gh/shunting314/259/orig 2025-12-04T12:46:35.5063150Z * [new branch] gh/shunting314/260/base -> origin/gh/shunting314/260/base 2025-12-04T12:46:35.5063224Z * [new branch] gh/shunting314/260/head -> origin/gh/shunting314/260/head 2025-12-04T12:46:35.5063299Z * [new branch] gh/shunting314/260/orig -> origin/gh/shunting314/260/orig 2025-12-04T12:46:35.5063372Z * [new branch] gh/shunting314/261/base -> origin/gh/shunting314/261/base 2025-12-04T12:46:35.5063446Z * [new branch] gh/shunting314/261/head -> origin/gh/shunting314/261/head 2025-12-04T12:46:35.5063520Z * [new branch] gh/shunting314/261/orig -> origin/gh/shunting314/261/orig 2025-12-04T12:46:35.5063597Z * [new branch] gh/shunting314/262/base -> origin/gh/shunting314/262/base 2025-12-04T12:46:35.5063671Z * [new branch] gh/shunting314/262/head -> origin/gh/shunting314/262/head 2025-12-04T12:46:35.5063745Z * [new branch] gh/shunting314/262/orig -> origin/gh/shunting314/262/orig 2025-12-04T12:46:35.5063821Z * [new branch] gh/shunting314/263/base -> origin/gh/shunting314/263/base 2025-12-04T12:46:35.5063894Z * [new branch] gh/shunting314/263/head -> origin/gh/shunting314/263/head 2025-12-04T12:46:35.5063969Z * [new branch] gh/shunting314/263/orig -> origin/gh/shunting314/263/orig 2025-12-04T12:46:35.5064047Z * [new branch] gh/shunting314/264/base -> origin/gh/shunting314/264/base 2025-12-04T12:46:35.5064122Z * [new branch] gh/shunting314/264/head -> origin/gh/shunting314/264/head 2025-12-04T12:46:35.5064197Z * [new branch] gh/shunting314/264/orig -> origin/gh/shunting314/264/orig 2025-12-04T12:46:35.5064274Z * [new branch] gh/shunting314/265/base -> origin/gh/shunting314/265/base 2025-12-04T12:46:35.5064348Z * [new branch] gh/shunting314/265/head -> origin/gh/shunting314/265/head 2025-12-04T12:46:35.5064421Z * [new branch] gh/shunting314/265/orig -> origin/gh/shunting314/265/orig 2025-12-04T12:46:35.5064495Z * [new branch] gh/shunting314/266/base -> origin/gh/shunting314/266/base 2025-12-04T12:46:35.5064570Z * [new branch] gh/shunting314/266/head -> origin/gh/shunting314/266/head 2025-12-04T12:46:35.5064646Z * [new branch] gh/shunting314/266/orig -> origin/gh/shunting314/266/orig 2025-12-04T12:46:35.5064722Z * [new branch] gh/shunting314/267/base -> origin/gh/shunting314/267/base 2025-12-04T12:46:35.5064797Z * [new branch] gh/shunting314/267/head -> origin/gh/shunting314/267/head 2025-12-04T12:46:35.5064873Z * [new branch] gh/shunting314/267/orig -> origin/gh/shunting314/267/orig 2025-12-04T12:46:35.5064949Z * [new branch] gh/shunting314/268/base -> origin/gh/shunting314/268/base 2025-12-04T12:46:35.5065023Z * [new branch] gh/shunting314/268/head -> origin/gh/shunting314/268/head 2025-12-04T12:46:35.5065096Z * [new branch] gh/shunting314/268/orig -> origin/gh/shunting314/268/orig 2025-12-04T12:46:35.5065171Z * [new branch] gh/shunting314/269/base -> origin/gh/shunting314/269/base 2025-12-04T12:46:35.5065281Z * [new branch] gh/shunting314/269/head -> origin/gh/shunting314/269/head 2025-12-04T12:46:35.5065356Z * [new branch] gh/shunting314/269/orig -> origin/gh/shunting314/269/orig 2025-12-04T12:46:35.5065429Z * [new branch] gh/silverguo/1/base -> origin/gh/silverguo/1/base 2025-12-04T12:46:35.5065501Z * [new branch] gh/silverguo/1/head -> origin/gh/silverguo/1/head 2025-12-04T12:46:35.5065597Z * [new branch] gh/silverguo/2/base -> origin/gh/silverguo/2/base 2025-12-04T12:46:35.5065667Z * [new branch] gh/silverguo/2/head -> origin/gh/silverguo/2/head 2025-12-04T12:46:35.5065737Z * [new branch] gh/silverguo/3/base -> origin/gh/silverguo/3/base 2025-12-04T12:46:35.5065808Z * [new branch] gh/silverguo/3/head -> origin/gh/silverguo/3/head 2025-12-04T12:46:35.5065877Z * [new branch] gh/silverguo/4/base -> origin/gh/silverguo/4/base 2025-12-04T12:46:35.5065948Z * [new branch] gh/silverguo/4/head -> origin/gh/silverguo/4/head 2025-12-04T12:46:35.5066023Z * [new branch] gh/slayton58/39/base -> origin/gh/slayton58/39/base 2025-12-04T12:46:35.5066095Z * [new branch] gh/slayton58/39/head -> origin/gh/slayton58/39/head 2025-12-04T12:46:35.5066165Z * [new branch] gh/slayton58/39/orig -> origin/gh/slayton58/39/orig 2025-12-04T12:46:35.5066238Z * [new branch] gh/slayton58/42/base -> origin/gh/slayton58/42/base 2025-12-04T12:46:35.5066308Z * [new branch] gh/slayton58/42/head -> origin/gh/slayton58/42/head 2025-12-04T12:46:35.5066377Z * [new branch] gh/slayton58/42/orig -> origin/gh/slayton58/42/orig 2025-12-04T12:46:35.5066447Z * [new branch] gh/slayton58/43/base -> origin/gh/slayton58/43/base 2025-12-04T12:46:35.5066517Z * [new branch] gh/slayton58/43/head -> origin/gh/slayton58/43/head 2025-12-04T12:46:35.5066587Z * [new branch] gh/slayton58/43/orig -> origin/gh/slayton58/43/orig 2025-12-04T12:46:35.5066659Z * [new branch] gh/slayton58/44/base -> origin/gh/slayton58/44/base 2025-12-04T12:46:35.5066728Z * [new branch] gh/slayton58/44/head -> origin/gh/slayton58/44/head 2025-12-04T12:46:35.5066798Z * [new branch] gh/slayton58/44/orig -> origin/gh/slayton58/44/orig 2025-12-04T12:46:35.5066868Z * [new branch] gh/slayton58/45/base -> origin/gh/slayton58/45/base 2025-12-04T12:46:35.5066938Z * [new branch] gh/slayton58/45/head -> origin/gh/slayton58/45/head 2025-12-04T12:46:35.5067010Z * [new branch] gh/slayton58/45/orig -> origin/gh/slayton58/45/orig 2025-12-04T12:46:35.5067080Z * [new branch] gh/slayton58/46/base -> origin/gh/slayton58/46/base 2025-12-04T12:46:35.5067149Z * [new branch] gh/slayton58/46/head -> origin/gh/slayton58/46/head 2025-12-04T12:46:35.5067220Z * [new branch] gh/slayton58/46/orig -> origin/gh/slayton58/46/orig 2025-12-04T12:46:35.5067290Z * [new branch] gh/slayton58/6/base -> origin/gh/slayton58/6/base 2025-12-04T12:46:35.5067361Z * [new branch] gh/slayton58/6/head -> origin/gh/slayton58/6/head 2025-12-04T12:46:35.5067430Z * [new branch] gh/slayton58/7/base -> origin/gh/slayton58/7/base 2025-12-04T12:46:35.5067546Z * [new branch] gh/slayton58/7/head -> origin/gh/slayton58/7/head 2025-12-04T12:46:35.5067621Z * [new branch] gh/soulitzer/269/base -> origin/gh/soulitzer/269/base 2025-12-04T12:46:35.5067696Z * [new branch] gh/soulitzer/269/head -> origin/gh/soulitzer/269/head 2025-12-04T12:46:35.5067769Z * [new branch] gh/soulitzer/269/orig -> origin/gh/soulitzer/269/orig 2025-12-04T12:46:35.5067840Z * [new branch] gh/soulitzer/276/base -> origin/gh/soulitzer/276/base 2025-12-04T12:46:35.5067949Z * [new branch] gh/soulitzer/276/head -> origin/gh/soulitzer/276/head 2025-12-04T12:46:35.5068020Z * [new branch] gh/soulitzer/276/orig -> origin/gh/soulitzer/276/orig 2025-12-04T12:46:35.5068092Z * [new branch] gh/soulitzer/287/base -> origin/gh/soulitzer/287/base 2025-12-04T12:46:35.5068164Z * [new branch] gh/soulitzer/287/head -> origin/gh/soulitzer/287/head 2025-12-04T12:46:35.5068269Z * [new branch] gh/soulitzer/287/orig -> origin/gh/soulitzer/287/orig 2025-12-04T12:46:35.5068341Z * [new branch] gh/soulitzer/296/base -> origin/gh/soulitzer/296/base 2025-12-04T12:46:35.5068413Z * [new branch] gh/soulitzer/296/head -> origin/gh/soulitzer/296/head 2025-12-04T12:46:35.5068485Z * [new branch] gh/soulitzer/296/orig -> origin/gh/soulitzer/296/orig 2025-12-04T12:46:35.5068557Z * [new branch] gh/soulitzer/299/base -> origin/gh/soulitzer/299/base 2025-12-04T12:46:35.5068629Z * [new branch] gh/soulitzer/299/head -> origin/gh/soulitzer/299/head 2025-12-04T12:46:35.5068701Z * [new branch] gh/soulitzer/299/orig -> origin/gh/soulitzer/299/orig 2025-12-04T12:46:35.5068773Z * [new branch] gh/soulitzer/300/base -> origin/gh/soulitzer/300/base 2025-12-04T12:46:35.5068846Z * [new branch] gh/soulitzer/300/head -> origin/gh/soulitzer/300/head 2025-12-04T12:46:35.5068916Z * [new branch] gh/soulitzer/300/orig -> origin/gh/soulitzer/300/orig 2025-12-04T12:46:35.5068988Z * [new branch] gh/soulitzer/301/base -> origin/gh/soulitzer/301/base 2025-12-04T12:46:35.5069059Z * [new branch] gh/soulitzer/301/head -> origin/gh/soulitzer/301/head 2025-12-04T12:46:35.5069131Z * [new branch] gh/soulitzer/301/orig -> origin/gh/soulitzer/301/orig 2025-12-04T12:46:35.5069204Z * [new branch] gh/soulitzer/313/base -> origin/gh/soulitzer/313/base 2025-12-04T12:46:35.5069277Z * [new branch] gh/soulitzer/313/head -> origin/gh/soulitzer/313/head 2025-12-04T12:46:35.5069350Z * [new branch] gh/soulitzer/313/orig -> origin/gh/soulitzer/313/orig 2025-12-04T12:46:35.5069421Z * [new branch] gh/soulitzer/319/base -> origin/gh/soulitzer/319/base 2025-12-04T12:46:35.5069494Z * [new branch] gh/soulitzer/319/head -> origin/gh/soulitzer/319/head 2025-12-04T12:46:35.5069566Z * [new branch] gh/soulitzer/319/orig -> origin/gh/soulitzer/319/orig 2025-12-04T12:46:35.5069637Z * [new branch] gh/soulitzer/320/base -> origin/gh/soulitzer/320/base 2025-12-04T12:46:35.5069708Z * [new branch] gh/soulitzer/320/head -> origin/gh/soulitzer/320/head 2025-12-04T12:46:35.5069780Z * [new branch] gh/soulitzer/320/orig -> origin/gh/soulitzer/320/orig 2025-12-04T12:46:35.5069852Z * [new branch] gh/soulitzer/336/base -> origin/gh/soulitzer/336/base 2025-12-04T12:46:35.5069925Z * [new branch] gh/soulitzer/336/head -> origin/gh/soulitzer/336/head 2025-12-04T12:46:35.5069997Z * [new branch] gh/soulitzer/336/orig -> origin/gh/soulitzer/336/orig 2025-12-04T12:46:35.5070069Z * [new branch] gh/soulitzer/347/base -> origin/gh/soulitzer/347/base 2025-12-04T12:46:35.5070141Z * [new branch] gh/soulitzer/347/head -> origin/gh/soulitzer/347/head 2025-12-04T12:46:35.5070213Z * [new branch] gh/soulitzer/347/orig -> origin/gh/soulitzer/347/orig 2025-12-04T12:46:35.5070284Z * [new branch] gh/soulitzer/349/base -> origin/gh/soulitzer/349/base 2025-12-04T12:46:35.5070355Z * [new branch] gh/soulitzer/349/head -> origin/gh/soulitzer/349/head 2025-12-04T12:46:35.5070427Z * [new branch] gh/soulitzer/349/orig -> origin/gh/soulitzer/349/orig 2025-12-04T12:46:35.5070527Z * [new branch] gh/soulitzer/350/base -> origin/gh/soulitzer/350/base 2025-12-04T12:46:35.5070600Z * [new branch] gh/soulitzer/350/head -> origin/gh/soulitzer/350/head 2025-12-04T12:46:35.5070671Z * [new branch] gh/soulitzer/350/orig -> origin/gh/soulitzer/350/orig 2025-12-04T12:46:35.5070742Z * [new branch] gh/soulitzer/351/base -> origin/gh/soulitzer/351/base 2025-12-04T12:46:35.5070837Z * [new branch] gh/soulitzer/351/head -> origin/gh/soulitzer/351/head 2025-12-04T12:46:35.5070908Z * [new branch] gh/soulitzer/351/orig -> origin/gh/soulitzer/351/orig 2025-12-04T12:46:35.5070980Z * [new branch] gh/soulitzer/353/base -> origin/gh/soulitzer/353/base 2025-12-04T12:46:35.5071052Z * [new branch] gh/soulitzer/353/head -> origin/gh/soulitzer/353/head 2025-12-04T12:46:35.5071124Z * [new branch] gh/soulitzer/353/orig -> origin/gh/soulitzer/353/orig 2025-12-04T12:46:35.5071196Z * [new branch] gh/soulitzer/358/base -> origin/gh/soulitzer/358/base 2025-12-04T12:46:35.5071268Z * [new branch] gh/soulitzer/358/head -> origin/gh/soulitzer/358/head 2025-12-04T12:46:35.5071339Z * [new branch] gh/soulitzer/358/orig -> origin/gh/soulitzer/358/orig 2025-12-04T12:46:35.5071411Z * [new branch] gh/soulitzer/359/base -> origin/gh/soulitzer/359/base 2025-12-04T12:46:35.5071484Z * [new branch] gh/soulitzer/359/head -> origin/gh/soulitzer/359/head 2025-12-04T12:46:35.5071556Z * [new branch] gh/soulitzer/359/orig -> origin/gh/soulitzer/359/orig 2025-12-04T12:46:35.5071626Z * [new branch] gh/soulitzer/374/base -> origin/gh/soulitzer/374/base 2025-12-04T12:46:35.5071698Z * [new branch] gh/soulitzer/374/head -> origin/gh/soulitzer/374/head 2025-12-04T12:46:35.5071769Z * [new branch] gh/soulitzer/374/orig -> origin/gh/soulitzer/374/orig 2025-12-04T12:46:35.5071842Z * [new branch] gh/soulitzer/375/base -> origin/gh/soulitzer/375/base 2025-12-04T12:46:35.5071913Z * [new branch] gh/soulitzer/375/head -> origin/gh/soulitzer/375/head 2025-12-04T12:46:35.5071984Z * [new branch] gh/soulitzer/375/orig -> origin/gh/soulitzer/375/orig 2025-12-04T12:46:35.5072055Z * [new branch] gh/soulitzer/380/base -> origin/gh/soulitzer/380/base 2025-12-04T12:46:35.5072129Z * [new branch] gh/soulitzer/380/head -> origin/gh/soulitzer/380/head 2025-12-04T12:46:35.5072200Z * [new branch] gh/soulitzer/380/orig -> origin/gh/soulitzer/380/orig 2025-12-04T12:46:35.5072273Z * [new branch] gh/soulitzer/385/base -> origin/gh/soulitzer/385/base 2025-12-04T12:46:35.5072344Z * [new branch] gh/soulitzer/385/head -> origin/gh/soulitzer/385/head 2025-12-04T12:46:35.5072415Z * [new branch] gh/soulitzer/385/orig -> origin/gh/soulitzer/385/orig 2025-12-04T12:46:35.5072488Z * [new branch] gh/soulitzer/386/base -> origin/gh/soulitzer/386/base 2025-12-04T12:46:35.5072560Z * [new branch] gh/soulitzer/386/head -> origin/gh/soulitzer/386/head 2025-12-04T12:46:35.5072631Z * [new branch] gh/soulitzer/386/orig -> origin/gh/soulitzer/386/orig 2025-12-04T12:46:35.5072705Z * [new branch] gh/soulitzer/387/base -> origin/gh/soulitzer/387/base 2025-12-04T12:46:35.5072778Z * [new branch] gh/soulitzer/387/head -> origin/gh/soulitzer/387/head 2025-12-04T12:46:35.5072850Z * [new branch] gh/soulitzer/387/orig -> origin/gh/soulitzer/387/orig 2025-12-04T12:46:35.5072923Z * [new branch] gh/soulitzer/388/base -> origin/gh/soulitzer/388/base 2025-12-04T12:46:35.5072995Z * [new branch] gh/soulitzer/388/head -> origin/gh/soulitzer/388/head 2025-12-04T12:46:35.5073066Z * [new branch] gh/soulitzer/388/orig -> origin/gh/soulitzer/388/orig 2025-12-04T12:46:35.5073164Z * [new branch] gh/soulitzer/389/base -> origin/gh/soulitzer/389/base 2025-12-04T12:46:35.5073235Z * [new branch] gh/soulitzer/389/head -> origin/gh/soulitzer/389/head 2025-12-04T12:46:35.5073306Z * [new branch] gh/soulitzer/389/orig -> origin/gh/soulitzer/389/orig 2025-12-04T12:46:35.5073407Z * [new branch] gh/soulitzer/390/base -> origin/gh/soulitzer/390/base 2025-12-04T12:46:35.5073478Z * [new branch] gh/soulitzer/390/head -> origin/gh/soulitzer/390/head 2025-12-04T12:46:35.5073550Z * [new branch] gh/soulitzer/390/orig -> origin/gh/soulitzer/390/orig 2025-12-04T12:46:35.5073623Z * [new branch] gh/soulitzer/391/base -> origin/gh/soulitzer/391/base 2025-12-04T12:46:35.5073693Z * [new branch] gh/soulitzer/391/head -> origin/gh/soulitzer/391/head 2025-12-04T12:46:35.5073766Z * [new branch] gh/soulitzer/391/orig -> origin/gh/soulitzer/391/orig 2025-12-04T12:46:35.5073838Z * [new branch] gh/soulitzer/392/base -> origin/gh/soulitzer/392/base 2025-12-04T12:46:35.5073909Z * [new branch] gh/soulitzer/392/head -> origin/gh/soulitzer/392/head 2025-12-04T12:46:35.5073981Z * [new branch] gh/soulitzer/392/orig -> origin/gh/soulitzer/392/orig 2025-12-04T12:46:35.5074053Z * [new branch] gh/swolchok/728/next -> origin/gh/swolchok/728/next 2025-12-04T12:46:35.5074123Z * [new branch] gh/swolchok/819/base -> origin/gh/swolchok/819/base 2025-12-04T12:46:35.5074194Z * [new branch] gh/swolchok/819/head -> origin/gh/swolchok/819/head 2025-12-04T12:46:35.5074263Z * [new branch] gh/swolchok/819/orig -> origin/gh/swolchok/819/orig 2025-12-04T12:46:35.5074333Z * [new branch] gh/swolchok/824/base -> origin/gh/swolchok/824/base 2025-12-04T12:46:35.5074405Z * [new branch] gh/swolchok/824/head -> origin/gh/swolchok/824/head 2025-12-04T12:46:35.5074475Z * [new branch] gh/swolchok/824/orig -> origin/gh/swolchok/824/orig 2025-12-04T12:46:35.5074545Z * [new branch] gh/swolchok/829/base -> origin/gh/swolchok/829/base 2025-12-04T12:46:35.5074616Z * [new branch] gh/swolchok/829/head -> origin/gh/swolchok/829/head 2025-12-04T12:46:35.5074687Z * [new branch] gh/swolchok/829/orig -> origin/gh/swolchok/829/orig 2025-12-04T12:46:35.5074756Z * [new branch] gh/swolchok/839/base -> origin/gh/swolchok/839/base 2025-12-04T12:46:35.5074826Z * [new branch] gh/swolchok/839/head -> origin/gh/swolchok/839/head 2025-12-04T12:46:35.5074895Z * [new branch] gh/swolchok/839/orig -> origin/gh/swolchok/839/orig 2025-12-04T12:46:35.5074964Z * [new branch] gh/swolchok/841/base -> origin/gh/swolchok/841/base 2025-12-04T12:46:35.5075035Z * [new branch] gh/swolchok/841/head -> origin/gh/swolchok/841/head 2025-12-04T12:46:35.5075105Z * [new branch] gh/swolchok/841/orig -> origin/gh/swolchok/841/orig 2025-12-04T12:46:35.5075175Z * [new branch] gh/swolchok/842/base -> origin/gh/swolchok/842/base 2025-12-04T12:46:35.5075245Z * [new branch] gh/swolchok/842/head -> origin/gh/swolchok/842/head 2025-12-04T12:46:35.5075315Z * [new branch] gh/swolchok/842/orig -> origin/gh/swolchok/842/orig 2025-12-04T12:46:35.5075386Z * [new branch] gh/swolchok/845/base -> origin/gh/swolchok/845/base 2025-12-04T12:46:35.5075456Z * [new branch] gh/swolchok/845/head -> origin/gh/swolchok/845/head 2025-12-04T12:46:35.5075525Z * [new branch] gh/swolchok/845/orig -> origin/gh/swolchok/845/orig 2025-12-04T12:46:35.5075594Z * [new branch] gh/swolchok/848/base -> origin/gh/swolchok/848/base 2025-12-04T12:46:35.5075703Z * [new branch] gh/swolchok/848/head -> origin/gh/swolchok/848/head 2025-12-04T12:46:35.5075773Z * [new branch] gh/swolchok/848/orig -> origin/gh/swolchok/848/orig 2025-12-04T12:46:35.5075844Z * [new branch] gh/swolchok/856/base -> origin/gh/swolchok/856/base 2025-12-04T12:46:35.5075913Z * [new branch] gh/swolchok/856/head -> origin/gh/swolchok/856/head 2025-12-04T12:46:35.5076005Z * [new branch] gh/swolchok/856/orig -> origin/gh/swolchok/856/orig 2025-12-04T12:46:35.5076076Z * [new branch] gh/swolchok/860/base -> origin/gh/swolchok/860/base 2025-12-04T12:46:35.5076145Z * [new branch] gh/swolchok/860/head -> origin/gh/swolchok/860/head 2025-12-04T12:46:35.5076214Z * [new branch] gh/swolchok/860/orig -> origin/gh/swolchok/860/orig 2025-12-04T12:46:35.5076284Z * [new branch] gh/swolchok/861/base -> origin/gh/swolchok/861/base 2025-12-04T12:46:35.5076355Z * [new branch] gh/swolchok/861/head -> origin/gh/swolchok/861/head 2025-12-04T12:46:35.5076425Z * [new branch] gh/swolchok/861/orig -> origin/gh/swolchok/861/orig 2025-12-04T12:46:35.5076495Z * [new branch] gh/swolchok/862/base -> origin/gh/swolchok/862/base 2025-12-04T12:46:35.5076564Z * [new branch] gh/swolchok/862/head -> origin/gh/swolchok/862/head 2025-12-04T12:46:35.5076635Z * [new branch] gh/swolchok/862/orig -> origin/gh/swolchok/862/orig 2025-12-04T12:46:35.5076705Z * [new branch] gh/swolchok/863/base -> origin/gh/swolchok/863/base 2025-12-04T12:46:35.5076774Z * [new branch] gh/swolchok/863/head -> origin/gh/swolchok/863/head 2025-12-04T12:46:35.5076845Z * [new branch] gh/swolchok/863/orig -> origin/gh/swolchok/863/orig 2025-12-04T12:46:35.5076914Z * [new branch] gh/swolchok/864/base -> origin/gh/swolchok/864/base 2025-12-04T12:46:35.5076985Z * [new branch] gh/swolchok/864/head -> origin/gh/swolchok/864/head 2025-12-04T12:46:35.5077055Z * [new branch] gh/swolchok/864/orig -> origin/gh/swolchok/864/orig 2025-12-04T12:46:35.5077125Z * [new branch] gh/swolchok/865/base -> origin/gh/swolchok/865/base 2025-12-04T12:46:35.5077195Z * [new branch] gh/swolchok/865/head -> origin/gh/swolchok/865/head 2025-12-04T12:46:35.5077266Z * [new branch] gh/swolchok/865/orig -> origin/gh/swolchok/865/orig 2025-12-04T12:46:35.5077336Z * [new branch] gh/swolchok/866/base -> origin/gh/swolchok/866/base 2025-12-04T12:46:35.5077405Z * [new branch] gh/swolchok/866/head -> origin/gh/swolchok/866/head 2025-12-04T12:46:35.5077535Z * [new branch] gh/swolchok/866/orig -> origin/gh/swolchok/866/orig 2025-12-04T12:46:35.5077607Z * [new branch] gh/swolchok/867/base -> origin/gh/swolchok/867/base 2025-12-04T12:46:35.5077678Z * [new branch] gh/swolchok/867/head -> origin/gh/swolchok/867/head 2025-12-04T12:46:35.5077748Z * [new branch] gh/swolchok/867/orig -> origin/gh/swolchok/867/orig 2025-12-04T12:46:35.5077817Z * [new branch] gh/swolchok/868/base -> origin/gh/swolchok/868/base 2025-12-04T12:46:35.5077885Z * [new branch] gh/swolchok/868/head -> origin/gh/swolchok/868/head 2025-12-04T12:46:35.5077958Z * [new branch] gh/swolchok/868/orig -> origin/gh/swolchok/868/orig 2025-12-04T12:46:35.5078027Z * [new branch] gh/swolchok/869/base -> origin/gh/swolchok/869/base 2025-12-04T12:46:35.5078096Z * [new branch] gh/swolchok/869/head -> origin/gh/swolchok/869/head 2025-12-04T12:46:35.5078166Z * [new branch] gh/swolchok/869/orig -> origin/gh/swolchok/869/orig 2025-12-04T12:46:35.5078235Z * [new branch] gh/swolchok/870/base -> origin/gh/swolchok/870/base 2025-12-04T12:46:35.5078349Z * [new branch] gh/swolchok/870/head -> origin/gh/swolchok/870/head 2025-12-04T12:46:35.5078420Z * [new branch] gh/swolchok/870/orig -> origin/gh/swolchok/870/orig 2025-12-04T12:46:35.5078489Z * [new branch] gh/swolchok/871/base -> origin/gh/swolchok/871/base 2025-12-04T12:46:35.5078597Z * [new branch] gh/swolchok/871/head -> origin/gh/swolchok/871/head 2025-12-04T12:46:35.5078668Z * [new branch] gh/swolchok/871/orig -> origin/gh/swolchok/871/orig 2025-12-04T12:46:35.5078739Z * [new branch] gh/teja-rao/4/base -> origin/gh/teja-rao/4/base 2025-12-04T12:46:35.5078809Z * [new branch] gh/teja-rao/4/head -> origin/gh/teja-rao/4/head 2025-12-04T12:46:35.5078877Z * [new branch] gh/teja-rao/4/orig -> origin/gh/teja-rao/4/orig 2025-12-04T12:46:35.5078947Z * [new branch] gh/tianyu-l/2/base -> origin/gh/tianyu-l/2/base 2025-12-04T12:46:35.5079017Z * [new branch] gh/tianyu-l/2/head -> origin/gh/tianyu-l/2/head 2025-12-04T12:46:35.5079085Z * [new branch] gh/tianyu-l/2/orig -> origin/gh/tianyu-l/2/orig 2025-12-04T12:46:35.5079153Z * [new branch] gh/tianyu-l/3/base -> origin/gh/tianyu-l/3/base 2025-12-04T12:46:35.5079223Z * [new branch] gh/tianyu-l/3/orig -> origin/gh/tianyu-l/3/orig 2025-12-04T12:46:35.5079290Z * [new branch] gh/tianyu-l/4/base -> origin/gh/tianyu-l/4/base 2025-12-04T12:46:35.5079357Z * [new branch] gh/tianyu-l/4/head -> origin/gh/tianyu-l/4/head 2025-12-04T12:46:35.5079426Z * [new branch] gh/tianyu-l/4/orig -> origin/gh/tianyu-l/4/orig 2025-12-04T12:46:35.5079515Z * [new branch] gh/tugsbayasgalan/10/base -> origin/gh/tugsbayasgalan/10/base 2025-12-04T12:46:35.5079599Z * [new branch] gh/tugsbayasgalan/10/head -> origin/gh/tugsbayasgalan/10/head 2025-12-04T12:46:35.5079685Z * [new branch] gh/tugsbayasgalan/10/orig -> origin/gh/tugsbayasgalan/10/orig 2025-12-04T12:46:35.5079767Z * [new branch] gh/tugsbayasgalan/13/base -> origin/gh/tugsbayasgalan/13/base 2025-12-04T12:46:35.5079848Z * [new branch] gh/tugsbayasgalan/13/head -> origin/gh/tugsbayasgalan/13/head 2025-12-04T12:46:35.5079932Z * [new branch] gh/tugsbayasgalan/13/orig -> origin/gh/tugsbayasgalan/13/orig 2025-12-04T12:46:35.5080014Z * [new branch] gh/tugsbayasgalan/17/base -> origin/gh/tugsbayasgalan/17/base 2025-12-04T12:46:35.5080096Z * [new branch] gh/tugsbayasgalan/17/head -> origin/gh/tugsbayasgalan/17/head 2025-12-04T12:46:35.5080178Z * [new branch] gh/tugsbayasgalan/17/orig -> origin/gh/tugsbayasgalan/17/orig 2025-12-04T12:46:35.5080262Z * [new branch] gh/tugsbayasgalan/2/base -> origin/gh/tugsbayasgalan/2/base 2025-12-04T12:46:35.5080344Z * [new branch] gh/tugsbayasgalan/2/head -> origin/gh/tugsbayasgalan/2/head 2025-12-04T12:46:35.5080424Z * [new branch] gh/tugsbayasgalan/2/orig -> origin/gh/tugsbayasgalan/2/orig 2025-12-04T12:46:35.5080507Z * [new branch] gh/tugsbayasgalan/28/base -> origin/gh/tugsbayasgalan/28/base 2025-12-04T12:46:35.5080590Z * [new branch] gh/tugsbayasgalan/28/head -> origin/gh/tugsbayasgalan/28/head 2025-12-04T12:46:35.5080673Z * [new branch] gh/tugsbayasgalan/28/orig -> origin/gh/tugsbayasgalan/28/orig 2025-12-04T12:46:35.5080754Z * [new branch] gh/tugsbayasgalan/32/base -> origin/gh/tugsbayasgalan/32/base 2025-12-04T12:46:35.5080836Z * [new branch] gh/tugsbayasgalan/32/head -> origin/gh/tugsbayasgalan/32/head 2025-12-04T12:46:35.5080918Z * [new branch] gh/tugsbayasgalan/32/orig -> origin/gh/tugsbayasgalan/32/orig 2025-12-04T12:46:35.5080999Z * [new branch] gh/tugsbayasgalan/35/base -> origin/gh/tugsbayasgalan/35/base 2025-12-04T12:46:35.5081108Z * [new branch] gh/tugsbayasgalan/35/head -> origin/gh/tugsbayasgalan/35/head 2025-12-04T12:46:35.5081191Z * [new branch] gh/tugsbayasgalan/35/orig -> origin/gh/tugsbayasgalan/35/orig 2025-12-04T12:46:35.5081272Z * [new branch] gh/tugsbayasgalan/36/base -> origin/gh/tugsbayasgalan/36/base 2025-12-04T12:46:35.5081373Z * [new branch] gh/tugsbayasgalan/36/head -> origin/gh/tugsbayasgalan/36/head 2025-12-04T12:46:35.5081454Z * [new branch] gh/tugsbayasgalan/36/orig -> origin/gh/tugsbayasgalan/36/orig 2025-12-04T12:46:35.5081536Z * [new branch] gh/tugsbayasgalan/37/base -> origin/gh/tugsbayasgalan/37/base 2025-12-04T12:46:35.5081618Z * [new branch] gh/tugsbayasgalan/37/head -> origin/gh/tugsbayasgalan/37/head 2025-12-04T12:46:35.5081699Z * [new branch] gh/tugsbayasgalan/37/orig -> origin/gh/tugsbayasgalan/37/orig 2025-12-04T12:46:35.5081784Z * [new branch] gh/tugsbayasgalan/43/base -> origin/gh/tugsbayasgalan/43/base 2025-12-04T12:46:35.5081865Z * [new branch] gh/tugsbayasgalan/43/head -> origin/gh/tugsbayasgalan/43/head 2025-12-04T12:46:35.5081947Z * [new branch] gh/tugsbayasgalan/43/orig -> origin/gh/tugsbayasgalan/43/orig 2025-12-04T12:46:35.5082030Z * [new branch] gh/tugsbayasgalan/48/base -> origin/gh/tugsbayasgalan/48/base 2025-12-04T12:46:35.5082112Z * [new branch] gh/tugsbayasgalan/48/head -> origin/gh/tugsbayasgalan/48/head 2025-12-04T12:46:35.5082194Z * [new branch] gh/tugsbayasgalan/48/orig -> origin/gh/tugsbayasgalan/48/orig 2025-12-04T12:46:35.5082275Z * [new branch] gh/tugsbayasgalan/51/base -> origin/gh/tugsbayasgalan/51/base 2025-12-04T12:46:35.5082357Z * [new branch] gh/tugsbayasgalan/51/head -> origin/gh/tugsbayasgalan/51/head 2025-12-04T12:46:35.5082439Z * [new branch] gh/tugsbayasgalan/51/orig -> origin/gh/tugsbayasgalan/51/orig 2025-12-04T12:46:35.5082522Z * [new branch] gh/tugsbayasgalan/52/base -> origin/gh/tugsbayasgalan/52/base 2025-12-04T12:46:35.5082604Z * [new branch] gh/tugsbayasgalan/52/head -> origin/gh/tugsbayasgalan/52/head 2025-12-04T12:46:35.5082685Z * [new branch] gh/tugsbayasgalan/52/orig -> origin/gh/tugsbayasgalan/52/orig 2025-12-04T12:46:35.5082769Z * [new branch] gh/tugsbayasgalan/53/base -> origin/gh/tugsbayasgalan/53/base 2025-12-04T12:46:35.5082851Z * [new branch] gh/tugsbayasgalan/53/head -> origin/gh/tugsbayasgalan/53/head 2025-12-04T12:46:35.5082932Z * [new branch] gh/tugsbayasgalan/53/orig -> origin/gh/tugsbayasgalan/53/orig 2025-12-04T12:46:35.5083015Z * [new branch] gh/tugsbayasgalan/55/base -> origin/gh/tugsbayasgalan/55/base 2025-12-04T12:46:35.5083096Z * [new branch] gh/tugsbayasgalan/55/head -> origin/gh/tugsbayasgalan/55/head 2025-12-04T12:46:35.5083181Z * [new branch] gh/tugsbayasgalan/55/orig -> origin/gh/tugsbayasgalan/55/orig 2025-12-04T12:46:35.5083262Z * [new branch] gh/tugsbayasgalan/59/base -> origin/gh/tugsbayasgalan/59/base 2025-12-04T12:46:35.5083344Z * [new branch] gh/tugsbayasgalan/59/head -> origin/gh/tugsbayasgalan/59/head 2025-12-04T12:46:35.5083428Z * [new branch] gh/tugsbayasgalan/59/orig -> origin/gh/tugsbayasgalan/59/orig 2025-12-04T12:46:35.5083508Z * [new branch] gh/tugsbayasgalan/6/base -> origin/gh/tugsbayasgalan/6/base 2025-12-04T12:46:35.5083588Z * [new branch] gh/tugsbayasgalan/6/head -> origin/gh/tugsbayasgalan/6/head 2025-12-04T12:46:35.5083668Z * [new branch] gh/tugsbayasgalan/6/orig -> origin/gh/tugsbayasgalan/6/orig 2025-12-04T12:46:35.5083750Z * [new branch] gh/tugsbayasgalan/60/base -> origin/gh/tugsbayasgalan/60/base 2025-12-04T12:46:35.5083853Z * [new branch] gh/tugsbayasgalan/60/head -> origin/gh/tugsbayasgalan/60/head 2025-12-04T12:46:35.5083936Z * [new branch] gh/tugsbayasgalan/60/orig -> origin/gh/tugsbayasgalan/60/orig 2025-12-04T12:46:35.5084017Z * [new branch] gh/tugsbayasgalan/61/base -> origin/gh/tugsbayasgalan/61/base 2025-12-04T12:46:35.5084099Z * [new branch] gh/tugsbayasgalan/61/head -> origin/gh/tugsbayasgalan/61/head 2025-12-04T12:46:35.5084215Z * [new branch] gh/tugsbayasgalan/61/orig -> origin/gh/tugsbayasgalan/61/orig 2025-12-04T12:46:35.5084298Z * [new branch] gh/tugsbayasgalan/63/base -> origin/gh/tugsbayasgalan/63/base 2025-12-04T12:46:35.5084379Z * [new branch] gh/tugsbayasgalan/63/head -> origin/gh/tugsbayasgalan/63/head 2025-12-04T12:46:35.5084462Z * [new branch] gh/tugsbayasgalan/63/orig -> origin/gh/tugsbayasgalan/63/orig 2025-12-04T12:46:35.5084543Z * [new branch] gh/tugsbayasgalan/67/base -> origin/gh/tugsbayasgalan/67/base 2025-12-04T12:46:35.5084627Z * [new branch] gh/tugsbayasgalan/67/head -> origin/gh/tugsbayasgalan/67/head 2025-12-04T12:46:35.5084709Z * [new branch] gh/tugsbayasgalan/67/orig -> origin/gh/tugsbayasgalan/67/orig 2025-12-04T12:46:35.5084791Z * [new branch] gh/tugsbayasgalan/68/base -> origin/gh/tugsbayasgalan/68/base 2025-12-04T12:46:35.5084875Z * [new branch] gh/tugsbayasgalan/68/head -> origin/gh/tugsbayasgalan/68/head 2025-12-04T12:46:35.5084957Z * [new branch] gh/tugsbayasgalan/68/orig -> origin/gh/tugsbayasgalan/68/orig 2025-12-04T12:46:35.5085037Z * [new branch] gh/tugsbayasgalan/7/base -> origin/gh/tugsbayasgalan/7/base 2025-12-04T12:46:35.5085118Z * [new branch] gh/tugsbayasgalan/7/head -> origin/gh/tugsbayasgalan/7/head 2025-12-04T12:46:35.5085197Z * [new branch] gh/tugsbayasgalan/7/orig -> origin/gh/tugsbayasgalan/7/orig 2025-12-04T12:46:35.5085281Z * [new branch] gh/tugsbayasgalan/70/base -> origin/gh/tugsbayasgalan/70/base 2025-12-04T12:46:35.5085363Z * [new branch] gh/tugsbayasgalan/70/head -> origin/gh/tugsbayasgalan/70/head 2025-12-04T12:46:35.5085444Z * [new branch] gh/tugsbayasgalan/70/orig -> origin/gh/tugsbayasgalan/70/orig 2025-12-04T12:46:35.5085527Z * [new branch] gh/tugsbayasgalan/71/base -> origin/gh/tugsbayasgalan/71/base 2025-12-04T12:46:35.5085609Z * [new branch] gh/tugsbayasgalan/71/head -> origin/gh/tugsbayasgalan/71/head 2025-12-04T12:46:35.5085690Z * [new branch] gh/tugsbayasgalan/71/orig -> origin/gh/tugsbayasgalan/71/orig 2025-12-04T12:46:35.5085771Z * [new branch] gh/tugsbayasgalan/72/base -> origin/gh/tugsbayasgalan/72/base 2025-12-04T12:46:35.5085853Z * [new branch] gh/tugsbayasgalan/72/head -> origin/gh/tugsbayasgalan/72/head 2025-12-04T12:46:35.5085935Z * [new branch] gh/tugsbayasgalan/72/orig -> origin/gh/tugsbayasgalan/72/orig 2025-12-04T12:46:35.5086016Z * [new branch] gh/tugsbayasgalan/73/base -> origin/gh/tugsbayasgalan/73/base 2025-12-04T12:46:35.5086099Z * [new branch] gh/tugsbayasgalan/73/head -> origin/gh/tugsbayasgalan/73/head 2025-12-04T12:46:35.5086180Z * [new branch] gh/tugsbayasgalan/73/orig -> origin/gh/tugsbayasgalan/73/orig 2025-12-04T12:46:35.5086265Z * [new branch] gh/tugsbayasgalan/74/base -> origin/gh/tugsbayasgalan/74/base 2025-12-04T12:46:35.5086348Z * [new branch] gh/tugsbayasgalan/74/head -> origin/gh/tugsbayasgalan/74/head 2025-12-04T12:46:35.5086428Z * [new branch] gh/tugsbayasgalan/74/orig -> origin/gh/tugsbayasgalan/74/orig 2025-12-04T12:46:35.5086510Z * [new branch] gh/tugsbayasgalan/75/base -> origin/gh/tugsbayasgalan/75/base 2025-12-04T12:46:35.5086592Z * [new branch] gh/tugsbayasgalan/75/head -> origin/gh/tugsbayasgalan/75/head 2025-12-04T12:46:35.5086694Z * [new branch] gh/tugsbayasgalan/75/orig -> origin/gh/tugsbayasgalan/75/orig 2025-12-04T12:46:35.5086776Z * [new branch] gh/tugsbayasgalan/76/base -> origin/gh/tugsbayasgalan/76/base 2025-12-04T12:46:35.5086858Z * [new branch] gh/tugsbayasgalan/76/head -> origin/gh/tugsbayasgalan/76/head 2025-12-04T12:46:35.5086963Z * [new branch] gh/tugsbayasgalan/76/orig -> origin/gh/tugsbayasgalan/76/orig 2025-12-04T12:46:35.5087045Z * [new branch] gh/tugsbayasgalan/77/base -> origin/gh/tugsbayasgalan/77/base 2025-12-04T12:46:35.5087127Z * [new branch] gh/tugsbayasgalan/77/head -> origin/gh/tugsbayasgalan/77/head 2025-12-04T12:46:35.5087208Z * [new branch] gh/tugsbayasgalan/77/orig -> origin/gh/tugsbayasgalan/77/orig 2025-12-04T12:46:35.5087290Z * [new branch] gh/tugsbayasgalan/78/base -> origin/gh/tugsbayasgalan/78/base 2025-12-04T12:46:35.5087373Z * [new branch] gh/tugsbayasgalan/78/head -> origin/gh/tugsbayasgalan/78/head 2025-12-04T12:46:35.5087455Z * [new branch] gh/tugsbayasgalan/78/orig -> origin/gh/tugsbayasgalan/78/orig 2025-12-04T12:46:35.5087583Z * [new branch] gh/tugsbayasgalan/79/base -> origin/gh/tugsbayasgalan/79/base 2025-12-04T12:46:35.5087666Z * [new branch] gh/tugsbayasgalan/79/head -> origin/gh/tugsbayasgalan/79/head 2025-12-04T12:46:35.5087750Z * [new branch] gh/tugsbayasgalan/79/orig -> origin/gh/tugsbayasgalan/79/orig 2025-12-04T12:46:35.5087830Z * [new branch] gh/tugsbayasgalan/8/base -> origin/gh/tugsbayasgalan/8/base 2025-12-04T12:46:35.5087910Z * [new branch] gh/tugsbayasgalan/8/head -> origin/gh/tugsbayasgalan/8/head 2025-12-04T12:46:35.5087990Z * [new branch] gh/tugsbayasgalan/8/orig -> origin/gh/tugsbayasgalan/8/orig 2025-12-04T12:46:35.5088072Z * [new branch] gh/tugsbayasgalan/80/base -> origin/gh/tugsbayasgalan/80/base 2025-12-04T12:46:35.5088155Z * [new branch] gh/tugsbayasgalan/80/head -> origin/gh/tugsbayasgalan/80/head 2025-12-04T12:46:35.5088237Z * [new branch] gh/tugsbayasgalan/80/orig -> origin/gh/tugsbayasgalan/80/orig 2025-12-04T12:46:35.5088319Z * [new branch] gh/tugsbayasgalan/81/base -> origin/gh/tugsbayasgalan/81/base 2025-12-04T12:46:35.5088403Z * [new branch] gh/tugsbayasgalan/81/head -> origin/gh/tugsbayasgalan/81/head 2025-12-04T12:46:35.5088485Z * [new branch] gh/tugsbayasgalan/81/orig -> origin/gh/tugsbayasgalan/81/orig 2025-12-04T12:46:35.5088566Z * [new branch] gh/tugsbayasgalan/82/base -> origin/gh/tugsbayasgalan/82/base 2025-12-04T12:46:35.5088648Z * [new branch] gh/tugsbayasgalan/82/head -> origin/gh/tugsbayasgalan/82/head 2025-12-04T12:46:35.5088730Z * [new branch] gh/tugsbayasgalan/82/orig -> origin/gh/tugsbayasgalan/82/orig 2025-12-04T12:46:35.5088812Z * [new branch] gh/tugsbayasgalan/83/base -> origin/gh/tugsbayasgalan/83/base 2025-12-04T12:46:35.5088893Z * [new branch] gh/tugsbayasgalan/83/head -> origin/gh/tugsbayasgalan/83/head 2025-12-04T12:46:35.5088976Z * [new branch] gh/tugsbayasgalan/83/orig -> origin/gh/tugsbayasgalan/83/orig 2025-12-04T12:46:35.5089059Z * [new branch] gh/tugsbayasgalan/84/base -> origin/gh/tugsbayasgalan/84/base 2025-12-04T12:46:35.5089140Z * [new branch] gh/tugsbayasgalan/84/head -> origin/gh/tugsbayasgalan/84/head 2025-12-04T12:46:35.5089222Z * [new branch] gh/tugsbayasgalan/84/orig -> origin/gh/tugsbayasgalan/84/orig 2025-12-04T12:46:35.5089303Z * [new branch] gh/tugsbayasgalan/85/base -> origin/gh/tugsbayasgalan/85/base 2025-12-04T12:46:35.5089386Z * [new branch] gh/tugsbayasgalan/85/head -> origin/gh/tugsbayasgalan/85/head 2025-12-04T12:46:35.5089505Z * [new branch] gh/tugsbayasgalan/85/orig -> origin/gh/tugsbayasgalan/85/orig 2025-12-04T12:46:35.5089587Z * [new branch] gh/tugsbayasgalan/86/base -> origin/gh/tugsbayasgalan/86/base 2025-12-04T12:46:35.5089669Z * [new branch] gh/tugsbayasgalan/86/head -> origin/gh/tugsbayasgalan/86/head 2025-12-04T12:46:35.5089751Z * [new branch] gh/tugsbayasgalan/86/orig -> origin/gh/tugsbayasgalan/86/orig 2025-12-04T12:46:35.5089866Z * [new branch] gh/tugsbayasgalan/87/base -> origin/gh/tugsbayasgalan/87/base 2025-12-04T12:46:35.5089949Z * [new branch] gh/tugsbayasgalan/87/head -> origin/gh/tugsbayasgalan/87/head 2025-12-04T12:46:35.5090030Z * [new branch] gh/tugsbayasgalan/87/orig -> origin/gh/tugsbayasgalan/87/orig 2025-12-04T12:46:35.5090111Z * [new branch] gh/tugsbayasgalan/88/base -> origin/gh/tugsbayasgalan/88/base 2025-12-04T12:46:35.5090193Z * [new branch] gh/tugsbayasgalan/88/head -> origin/gh/tugsbayasgalan/88/head 2025-12-04T12:46:35.5090276Z * [new branch] gh/tugsbayasgalan/88/orig -> origin/gh/tugsbayasgalan/88/orig 2025-12-04T12:46:35.5090358Z * [new branch] gh/tugsbayasgalan/89/base -> origin/gh/tugsbayasgalan/89/base 2025-12-04T12:46:35.5090440Z * [new branch] gh/tugsbayasgalan/89/head -> origin/gh/tugsbayasgalan/89/head 2025-12-04T12:46:35.5090523Z * [new branch] gh/tugsbayasgalan/89/orig -> origin/gh/tugsbayasgalan/89/orig 2025-12-04T12:46:35.5090604Z * [new branch] gh/tugsbayasgalan/9/base -> origin/gh/tugsbayasgalan/9/base 2025-12-04T12:46:35.5090685Z * [new branch] gh/tugsbayasgalan/9/head -> origin/gh/tugsbayasgalan/9/head 2025-12-04T12:46:35.5090764Z * [new branch] gh/tugsbayasgalan/9/orig -> origin/gh/tugsbayasgalan/9/orig 2025-12-04T12:46:35.5090847Z * [new branch] gh/tugsbayasgalan/90/base -> origin/gh/tugsbayasgalan/90/base 2025-12-04T12:46:35.5090930Z * [new branch] gh/tugsbayasgalan/90/head -> origin/gh/tugsbayasgalan/90/head 2025-12-04T12:46:35.5091012Z * [new branch] gh/tugsbayasgalan/90/orig -> origin/gh/tugsbayasgalan/90/orig 2025-12-04T12:46:35.5091095Z * [new branch] gh/tugsbayasgalan/91/base -> origin/gh/tugsbayasgalan/91/base 2025-12-04T12:46:35.5091176Z * [new branch] gh/tugsbayasgalan/91/head -> origin/gh/tugsbayasgalan/91/head 2025-12-04T12:46:35.5091258Z * [new branch] gh/tugsbayasgalan/91/orig -> origin/gh/tugsbayasgalan/91/orig 2025-12-04T12:46:35.5091340Z * [new branch] gh/tugsbayasgalan/92/base -> origin/gh/tugsbayasgalan/92/base 2025-12-04T12:46:35.5091421Z * [new branch] gh/tugsbayasgalan/92/head -> origin/gh/tugsbayasgalan/92/head 2025-12-04T12:46:35.5091502Z * [new branch] gh/tugsbayasgalan/92/orig -> origin/gh/tugsbayasgalan/92/orig 2025-12-04T12:46:35.5091584Z * [new branch] gh/tugsbayasgalan/93/base -> origin/gh/tugsbayasgalan/93/base 2025-12-04T12:46:35.5091666Z * [new branch] gh/tugsbayasgalan/93/head -> origin/gh/tugsbayasgalan/93/head 2025-12-04T12:46:35.5091748Z * [new branch] gh/tugsbayasgalan/93/orig -> origin/gh/tugsbayasgalan/93/orig 2025-12-04T12:46:35.5091817Z * [new branch] gh/v0i0/14/base -> origin/gh/v0i0/14/base 2025-12-04T12:46:35.5091885Z * [new branch] gh/v0i0/14/head -> origin/gh/v0i0/14/head 2025-12-04T12:46:35.5091949Z * [new branch] gh/v0i0/14/orig -> origin/gh/v0i0/14/orig 2025-12-04T12:46:35.5092014Z * [new branch] gh/v0i0/15/base -> origin/gh/v0i0/15/base 2025-12-04T12:46:35.5092076Z * [new branch] gh/v0i0/15/head -> origin/gh/v0i0/15/head 2025-12-04T12:46:35.5092138Z * [new branch] gh/v0i0/15/orig -> origin/gh/v0i0/15/orig 2025-12-04T12:46:35.5092200Z * [new branch] gh/v0i0/16/base -> origin/gh/v0i0/16/base 2025-12-04T12:46:35.5092288Z * [new branch] gh/v0i0/16/head -> origin/gh/v0i0/16/head 2025-12-04T12:46:35.5092349Z * [new branch] gh/v0i0/16/orig -> origin/gh/v0i0/16/orig 2025-12-04T12:46:35.5092412Z * [new branch] gh/v0i0/17/base -> origin/gh/v0i0/17/base 2025-12-04T12:46:35.5092495Z * [new branch] gh/v0i0/17/head -> origin/gh/v0i0/17/head 2025-12-04T12:46:35.5092558Z * [new branch] gh/v0i0/17/orig -> origin/gh/v0i0/17/orig 2025-12-04T12:46:35.5092619Z * [new branch] gh/v0i0/18/base -> origin/gh/v0i0/18/base 2025-12-04T12:46:35.5092680Z * [new branch] gh/v0i0/18/head -> origin/gh/v0i0/18/head 2025-12-04T12:46:35.5092743Z * [new branch] gh/v0i0/18/orig -> origin/gh/v0i0/18/orig 2025-12-04T12:46:35.5092805Z * [new branch] gh/v0i0/19/base -> origin/gh/v0i0/19/base 2025-12-04T12:46:35.5092868Z * [new branch] gh/v0i0/19/head -> origin/gh/v0i0/19/head 2025-12-04T12:46:35.5092930Z * [new branch] gh/v0i0/19/orig -> origin/gh/v0i0/19/orig 2025-12-04T12:46:35.5093010Z * [new branch] gh/vishal9-team/1/base -> origin/gh/vishal9-team/1/base 2025-12-04T12:46:35.5093087Z * [new branch] gh/vishal9-team/1/head -> origin/gh/vishal9-team/1/head 2025-12-04T12:46:35.5093163Z * [new branch] gh/vishal9-team/2/base -> origin/gh/vishal9-team/2/base 2025-12-04T12:46:35.5093236Z * [new branch] gh/vishal9-team/2/head -> origin/gh/vishal9-team/2/head 2025-12-04T12:46:35.5093310Z * [new branch] gh/vishal9-team/2/orig -> origin/gh/vishal9-team/2/orig 2025-12-04T12:46:35.5093384Z * [new branch] gh/vishal9-team/3/base -> origin/gh/vishal9-team/3/base 2025-12-04T12:46:35.5093456Z * [new branch] gh/vishal9-team/3/head -> origin/gh/vishal9-team/3/head 2025-12-04T12:46:35.5093530Z * [new branch] gh/vishal9-team/3/orig -> origin/gh/vishal9-team/3/orig 2025-12-04T12:46:35.5093603Z * [new branch] gh/vishal9-team/4/base -> origin/gh/vishal9-team/4/base 2025-12-04T12:46:35.5093676Z * [new branch] gh/vishal9-team/4/head -> origin/gh/vishal9-team/4/head 2025-12-04T12:46:35.5093749Z * [new branch] gh/vishal9-team/4/orig -> origin/gh/vishal9-team/4/orig 2025-12-04T12:46:35.5093816Z * [new branch] gh/vkuzo/1/next -> origin/gh/vkuzo/1/next 2025-12-04T12:46:35.5093880Z * [new branch] gh/vkuzo/2/next -> origin/gh/vkuzo/2/next 2025-12-04T12:46:35.5093944Z * [new branch] gh/vkuzo/3/next -> origin/gh/vkuzo/3/next 2025-12-04T12:46:35.5094018Z * [new branch] gh/wconstab/424/base -> origin/gh/wconstab/424/base 2025-12-04T12:46:35.5094091Z * [new branch] gh/wconstab/424/head -> origin/gh/wconstab/424/head 2025-12-04T12:46:35.5094163Z * [new branch] gh/wconstab/424/orig -> origin/gh/wconstab/424/orig 2025-12-04T12:46:35.5094233Z * [new branch] gh/wconstab/435/base -> origin/gh/wconstab/435/base 2025-12-04T12:46:35.5094302Z * [new branch] gh/wconstab/435/head -> origin/gh/wconstab/435/head 2025-12-04T12:46:35.5094374Z * [new branch] gh/wconstab/435/orig -> origin/gh/wconstab/435/orig 2025-12-04T12:46:35.5094444Z * [new branch] gh/wconstab/444/base -> origin/gh/wconstab/444/base 2025-12-04T12:46:35.5094514Z * [new branch] gh/wconstab/444/head -> origin/gh/wconstab/444/head 2025-12-04T12:46:35.5094584Z * [new branch] gh/wconstab/444/orig -> origin/gh/wconstab/444/orig 2025-12-04T12:46:35.5094653Z * [new branch] gh/wconstab/447/base -> origin/gh/wconstab/447/base 2025-12-04T12:46:35.5094724Z * [new branch] gh/wconstab/447/head -> origin/gh/wconstab/447/head 2025-12-04T12:46:35.5094820Z * [new branch] gh/wconstab/447/orig -> origin/gh/wconstab/447/orig 2025-12-04T12:46:35.5094890Z * [new branch] gh/wconstab/448/base -> origin/gh/wconstab/448/base 2025-12-04T12:46:35.5094960Z * [new branch] gh/wconstab/448/head -> origin/gh/wconstab/448/head 2025-12-04T12:46:35.5095066Z * [new branch] gh/wconstab/448/orig -> origin/gh/wconstab/448/orig 2025-12-04T12:46:35.5095136Z * [new branch] gh/wconstab/449/base -> origin/gh/wconstab/449/base 2025-12-04T12:46:35.5095206Z * [new branch] gh/wconstab/449/head -> origin/gh/wconstab/449/head 2025-12-04T12:46:35.5095276Z * [new branch] gh/wconstab/449/orig -> origin/gh/wconstab/449/orig 2025-12-04T12:46:35.5095345Z * [new branch] gh/wconstab/450/base -> origin/gh/wconstab/450/base 2025-12-04T12:46:35.5095414Z * [new branch] gh/wconstab/450/head -> origin/gh/wconstab/450/head 2025-12-04T12:46:35.5095486Z * [new branch] gh/wconstab/450/orig -> origin/gh/wconstab/450/orig 2025-12-04T12:46:35.5095555Z * [new branch] gh/wconstab/451/base -> origin/gh/wconstab/451/base 2025-12-04T12:46:35.5095624Z * [new branch] gh/wconstab/451/head -> origin/gh/wconstab/451/head 2025-12-04T12:46:35.5095697Z * [new branch] gh/wconstab/451/orig -> origin/gh/wconstab/451/orig 2025-12-04T12:46:35.5095767Z * [new branch] gh/wconstab/452/base -> origin/gh/wconstab/452/base 2025-12-04T12:46:35.5095837Z * [new branch] gh/wconstab/452/head -> origin/gh/wconstab/452/head 2025-12-04T12:46:35.5095906Z * [new branch] gh/wconstab/452/orig -> origin/gh/wconstab/452/orig 2025-12-04T12:46:35.5095975Z * [new branch] gh/wconstab/453/base -> origin/gh/wconstab/453/base 2025-12-04T12:46:35.5096046Z * [new branch] gh/wconstab/453/head -> origin/gh/wconstab/453/head 2025-12-04T12:46:35.5096116Z * [new branch] gh/wconstab/453/orig -> origin/gh/wconstab/453/orig 2025-12-04T12:46:35.5096186Z * [new branch] gh/wconstab/454/base -> origin/gh/wconstab/454/base 2025-12-04T12:46:35.5096257Z * [new branch] gh/wconstab/454/head -> origin/gh/wconstab/454/head 2025-12-04T12:46:35.5096329Z * [new branch] gh/wconstab/454/orig -> origin/gh/wconstab/454/orig 2025-12-04T12:46:35.5096398Z * [new branch] gh/wconstab/455/base -> origin/gh/wconstab/455/base 2025-12-04T12:46:35.5096468Z * [new branch] gh/wconstab/455/head -> origin/gh/wconstab/455/head 2025-12-04T12:46:35.5096537Z * [new branch] gh/wconstab/455/orig -> origin/gh/wconstab/455/orig 2025-12-04T12:46:35.5096607Z * [new branch] gh/wconstab/456/base -> origin/gh/wconstab/456/base 2025-12-04T12:46:35.5096678Z * [new branch] gh/wconstab/456/head -> origin/gh/wconstab/456/head 2025-12-04T12:46:35.5096749Z * [new branch] gh/wconstab/456/orig -> origin/gh/wconstab/456/orig 2025-12-04T12:46:35.5096819Z * [new branch] gh/wconstab/457/base -> origin/gh/wconstab/457/base 2025-12-04T12:46:35.5096891Z * [new branch] gh/wconstab/457/head -> origin/gh/wconstab/457/head 2025-12-04T12:46:35.5096961Z * [new branch] gh/wconstab/457/orig -> origin/gh/wconstab/457/orig 2025-12-04T12:46:35.5097031Z * [new branch] gh/wconstab/458/base -> origin/gh/wconstab/458/base 2025-12-04T12:46:35.5097101Z * [new branch] gh/wconstab/458/head -> origin/gh/wconstab/458/head 2025-12-04T12:46:35.5097170Z * [new branch] gh/wconstab/458/orig -> origin/gh/wconstab/458/orig 2025-12-04T12:46:35.5097241Z * [new branch] gh/wconstab/459/base -> origin/gh/wconstab/459/base 2025-12-04T12:46:35.5097335Z * [new branch] gh/wconstab/459/head -> origin/gh/wconstab/459/head 2025-12-04T12:46:35.5097405Z * [new branch] gh/wconstab/459/orig -> origin/gh/wconstab/459/orig 2025-12-04T12:46:35.5097527Z * [new branch] gh/wconstab/460/base -> origin/gh/wconstab/460/base 2025-12-04T12:46:35.5097599Z * [new branch] gh/wconstab/460/head -> origin/gh/wconstab/460/head 2025-12-04T12:46:35.5097704Z * [new branch] gh/wconstab/460/orig -> origin/gh/wconstab/460/orig 2025-12-04T12:46:35.5097775Z * [new branch] gh/wconstab/461/base -> origin/gh/wconstab/461/base 2025-12-04T12:46:35.5097845Z * [new branch] gh/wconstab/461/head -> origin/gh/wconstab/461/head 2025-12-04T12:46:35.5097914Z * [new branch] gh/wconstab/461/orig -> origin/gh/wconstab/461/orig 2025-12-04T12:46:35.5097984Z * [new branch] gh/wconstab/462/base -> origin/gh/wconstab/462/base 2025-12-04T12:46:35.5098055Z * [new branch] gh/wconstab/462/head -> origin/gh/wconstab/462/head 2025-12-04T12:46:35.5098125Z * [new branch] gh/wconstab/462/orig -> origin/gh/wconstab/462/orig 2025-12-04T12:46:35.5098195Z * [new branch] gh/wconstab/463/base -> origin/gh/wconstab/463/base 2025-12-04T12:46:35.5098264Z * [new branch] gh/wconstab/463/head -> origin/gh/wconstab/463/head 2025-12-04T12:46:35.5098335Z * [new branch] gh/wconstab/463/orig -> origin/gh/wconstab/463/orig 2025-12-04T12:46:35.5098406Z * [new branch] gh/wconstab/464/base -> origin/gh/wconstab/464/base 2025-12-04T12:46:35.5098475Z * [new branch] gh/wconstab/464/head -> origin/gh/wconstab/464/head 2025-12-04T12:46:35.5098544Z * [new branch] gh/wconstab/464/orig -> origin/gh/wconstab/464/orig 2025-12-04T12:46:35.5098615Z * [new branch] gh/wconstab/465/base -> origin/gh/wconstab/465/base 2025-12-04T12:46:35.5098685Z * [new branch] gh/wconstab/465/head -> origin/gh/wconstab/465/head 2025-12-04T12:46:35.5098756Z * [new branch] gh/wconstab/465/orig -> origin/gh/wconstab/465/orig 2025-12-04T12:46:35.5098825Z * [new branch] gh/wconstab/466/base -> origin/gh/wconstab/466/base 2025-12-04T12:46:35.5098895Z * [new branch] gh/wconstab/466/head -> origin/gh/wconstab/466/head 2025-12-04T12:46:35.5098967Z * [new branch] gh/wconstab/466/orig -> origin/gh/wconstab/466/orig 2025-12-04T12:46:35.5099036Z * [new branch] gh/wconstab/467/base -> origin/gh/wconstab/467/base 2025-12-04T12:46:35.5099105Z * [new branch] gh/wconstab/467/head -> origin/gh/wconstab/467/head 2025-12-04T12:46:35.5099175Z * [new branch] gh/wconstab/467/orig -> origin/gh/wconstab/467/orig 2025-12-04T12:46:35.5099245Z * [new branch] gh/wconstab/468/base -> origin/gh/wconstab/468/base 2025-12-04T12:46:35.5099316Z * [new branch] gh/wconstab/468/head -> origin/gh/wconstab/468/head 2025-12-04T12:46:35.5099386Z * [new branch] gh/wconstab/468/orig -> origin/gh/wconstab/468/orig 2025-12-04T12:46:35.5099458Z * [new branch] gh/weifengpy/39/base -> origin/gh/weifengpy/39/base 2025-12-04T12:46:35.5099529Z * [new branch] gh/weifengpy/39/head -> origin/gh/weifengpy/39/head 2025-12-04T12:46:35.5099602Z * [new branch] gh/weifengpy/39/orig -> origin/gh/weifengpy/39/orig 2025-12-04T12:46:35.5099672Z * [new branch] gh/weifengpy/40/base -> origin/gh/weifengpy/40/base 2025-12-04T12:46:35.5099742Z * [new branch] gh/weifengpy/40/head -> origin/gh/weifengpy/40/head 2025-12-04T12:46:35.5099813Z * [new branch] gh/weifengpy/40/orig -> origin/gh/weifengpy/40/orig 2025-12-04T12:46:35.5099883Z * [new branch] gh/weifengpy/41/base -> origin/gh/weifengpy/41/base 2025-12-04T12:46:35.5099990Z * [new branch] gh/weifengpy/41/head -> origin/gh/weifengpy/41/head 2025-12-04T12:46:35.5100062Z * [new branch] gh/weifengpy/41/orig -> origin/gh/weifengpy/41/orig 2025-12-04T12:46:35.5100142Z * [new branch] gh/williamwen42/250/base -> origin/gh/williamwen42/250/base 2025-12-04T12:46:35.5100222Z * [new branch] gh/williamwen42/250/head -> origin/gh/williamwen42/250/head 2025-12-04T12:46:35.5100324Z * [new branch] gh/williamwen42/250/orig -> origin/gh/williamwen42/250/orig 2025-12-04T12:46:35.5100401Z * [new branch] gh/williamwen42/279/base -> origin/gh/williamwen42/279/base 2025-12-04T12:46:35.5100479Z * [new branch] gh/williamwen42/279/head -> origin/gh/williamwen42/279/head 2025-12-04T12:46:35.5100555Z * [new branch] gh/williamwen42/279/orig -> origin/gh/williamwen42/279/orig 2025-12-04T12:46:35.5100632Z * [new branch] gh/williamwen42/282/base -> origin/gh/williamwen42/282/base 2025-12-04T12:46:35.5100711Z * [new branch] gh/williamwen42/282/head -> origin/gh/williamwen42/282/head 2025-12-04T12:46:35.5100788Z * [new branch] gh/williamwen42/282/orig -> origin/gh/williamwen42/282/orig 2025-12-04T12:46:35.5100865Z * [new branch] gh/williamwen42/287/base -> origin/gh/williamwen42/287/base 2025-12-04T12:46:35.5100943Z * [new branch] gh/williamwen42/287/head -> origin/gh/williamwen42/287/head 2025-12-04T12:46:35.5101020Z * [new branch] gh/williamwen42/287/orig -> origin/gh/williamwen42/287/orig 2025-12-04T12:46:35.5101096Z * [new branch] gh/williamwen42/288/base -> origin/gh/williamwen42/288/base 2025-12-04T12:46:35.5101174Z * [new branch] gh/williamwen42/288/head -> origin/gh/williamwen42/288/head 2025-12-04T12:46:35.5101250Z * [new branch] gh/williamwen42/288/orig -> origin/gh/williamwen42/288/orig 2025-12-04T12:46:35.5101327Z * [new branch] gh/williamwen42/296/base -> origin/gh/williamwen42/296/base 2025-12-04T12:46:35.5101405Z * [new branch] gh/williamwen42/296/head -> origin/gh/williamwen42/296/head 2025-12-04T12:46:35.5101481Z * [new branch] gh/williamwen42/296/orig -> origin/gh/williamwen42/296/orig 2025-12-04T12:46:35.5101558Z * [new branch] gh/williamwen42/297/base -> origin/gh/williamwen42/297/base 2025-12-04T12:46:35.5101637Z * [new branch] gh/williamwen42/297/head -> origin/gh/williamwen42/297/head 2025-12-04T12:46:35.5101713Z * [new branch] gh/williamwen42/297/orig -> origin/gh/williamwen42/297/orig 2025-12-04T12:46:35.5101791Z * [new branch] gh/williamwen42/306/base -> origin/gh/williamwen42/306/base 2025-12-04T12:46:35.5101867Z * [new branch] gh/williamwen42/306/head -> origin/gh/williamwen42/306/head 2025-12-04T12:46:35.5101944Z * [new branch] gh/williamwen42/306/orig -> origin/gh/williamwen42/306/orig 2025-12-04T12:46:35.5102022Z * [new branch] gh/williamwen42/309/base -> origin/gh/williamwen42/309/base 2025-12-04T12:46:35.5102099Z * [new branch] gh/williamwen42/309/head -> origin/gh/williamwen42/309/head 2025-12-04T12:46:35.5102175Z * [new branch] gh/williamwen42/309/orig -> origin/gh/williamwen42/309/orig 2025-12-04T12:46:35.5102254Z * [new branch] gh/williamwen42/310/base -> origin/gh/williamwen42/310/base 2025-12-04T12:46:35.5102331Z * [new branch] gh/williamwen42/310/head -> origin/gh/williamwen42/310/head 2025-12-04T12:46:35.5102407Z * [new branch] gh/williamwen42/310/orig -> origin/gh/williamwen42/310/orig 2025-12-04T12:46:35.5102484Z * [new branch] gh/williamwen42/311/base -> origin/gh/williamwen42/311/base 2025-12-04T12:46:35.5102560Z * [new branch] gh/williamwen42/311/head -> origin/gh/williamwen42/311/head 2025-12-04T12:46:35.5102660Z * [new branch] gh/williamwen42/311/orig -> origin/gh/williamwen42/311/orig 2025-12-04T12:46:35.5102739Z * [new branch] gh/williamwen42/319/base -> origin/gh/williamwen42/319/base 2025-12-04T12:46:35.5102815Z * [new branch] gh/williamwen42/319/head -> origin/gh/williamwen42/319/head 2025-12-04T12:46:35.5102892Z * [new branch] gh/williamwen42/319/orig -> origin/gh/williamwen42/319/orig 2025-12-04T12:46:35.5102992Z * [new branch] gh/williamwen42/325/base -> origin/gh/williamwen42/325/base 2025-12-04T12:46:35.5103068Z * [new branch] gh/williamwen42/325/head -> origin/gh/williamwen42/325/head 2025-12-04T12:46:35.5103145Z * [new branch] gh/williamwen42/325/orig -> origin/gh/williamwen42/325/orig 2025-12-04T12:46:35.5103221Z * [new branch] gh/williamwen42/326/base -> origin/gh/williamwen42/326/base 2025-12-04T12:46:35.5103297Z * [new branch] gh/williamwen42/326/head -> origin/gh/williamwen42/326/head 2025-12-04T12:46:35.5103375Z * [new branch] gh/williamwen42/326/orig -> origin/gh/williamwen42/326/orig 2025-12-04T12:46:35.5103452Z * [new branch] gh/williamwen42/327/base -> origin/gh/williamwen42/327/base 2025-12-04T12:46:35.5103528Z * [new branch] gh/williamwen42/327/head -> origin/gh/williamwen42/327/head 2025-12-04T12:46:35.5103607Z * [new branch] gh/williamwen42/327/orig -> origin/gh/williamwen42/327/orig 2025-12-04T12:46:35.5103684Z * [new branch] gh/williamwen42/328/base -> origin/gh/williamwen42/328/base 2025-12-04T12:46:35.5103761Z * [new branch] gh/williamwen42/328/head -> origin/gh/williamwen42/328/head 2025-12-04T12:46:35.5103839Z * [new branch] gh/williamwen42/328/orig -> origin/gh/williamwen42/328/orig 2025-12-04T12:46:35.5103915Z * [new branch] gh/williamwen42/329/base -> origin/gh/williamwen42/329/base 2025-12-04T12:46:35.5103992Z * [new branch] gh/williamwen42/329/head -> origin/gh/williamwen42/329/head 2025-12-04T12:46:35.5104070Z * [new branch] gh/williamwen42/329/orig -> origin/gh/williamwen42/329/orig 2025-12-04T12:46:35.5104146Z * [new branch] gh/williamwen42/330/base -> origin/gh/williamwen42/330/base 2025-12-04T12:46:35.5104223Z * [new branch] gh/williamwen42/330/head -> origin/gh/williamwen42/330/head 2025-12-04T12:46:35.5104304Z * [new branch] gh/williamwen42/330/orig -> origin/gh/williamwen42/330/orig 2025-12-04T12:46:35.5104381Z * [new branch] gh/williamwen42/331/base -> origin/gh/williamwen42/331/base 2025-12-04T12:46:35.5104457Z * [new branch] gh/williamwen42/331/head -> origin/gh/williamwen42/331/head 2025-12-04T12:46:35.5104534Z * [new branch] gh/williamwen42/331/orig -> origin/gh/williamwen42/331/orig 2025-12-04T12:46:35.5104610Z * [new branch] gh/williamwen42/332/base -> origin/gh/williamwen42/332/base 2025-12-04T12:46:35.5104688Z * [new branch] gh/williamwen42/332/head -> origin/gh/williamwen42/332/head 2025-12-04T12:46:35.5104765Z * [new branch] gh/williamwen42/332/orig -> origin/gh/williamwen42/332/orig 2025-12-04T12:46:35.5104841Z * [new branch] gh/williamwen42/333/base -> origin/gh/williamwen42/333/base 2025-12-04T12:46:35.5104919Z * [new branch] gh/williamwen42/333/head -> origin/gh/williamwen42/333/head 2025-12-04T12:46:35.5104995Z * [new branch] gh/williamwen42/333/orig -> origin/gh/williamwen42/333/orig 2025-12-04T12:46:35.5105071Z * [new branch] gh/williamwen42/334/base -> origin/gh/williamwen42/334/base 2025-12-04T12:46:35.5105148Z * [new branch] gh/williamwen42/334/head -> origin/gh/williamwen42/334/head 2025-12-04T12:46:35.5105225Z * [new branch] gh/williamwen42/334/orig -> origin/gh/williamwen42/334/orig 2025-12-04T12:46:35.5105332Z * [new branch] gh/williamwen42/335/base -> origin/gh/williamwen42/335/base 2025-12-04T12:46:35.5105411Z * [new branch] gh/williamwen42/335/head -> origin/gh/williamwen42/335/head 2025-12-04T12:46:35.5105487Z * [new branch] gh/williamwen42/335/orig -> origin/gh/williamwen42/335/orig 2025-12-04T12:46:35.5105564Z * [new branch] gh/williamwen42/336/base -> origin/gh/williamwen42/336/base 2025-12-04T12:46:35.5105664Z * [new branch] gh/williamwen42/336/head -> origin/gh/williamwen42/336/head 2025-12-04T12:46:35.5105740Z * [new branch] gh/williamwen42/336/orig -> origin/gh/williamwen42/336/orig 2025-12-04T12:46:35.5105817Z * [new branch] gh/williamwen42/337/base -> origin/gh/williamwen42/337/base 2025-12-04T12:46:35.5105895Z * [new branch] gh/williamwen42/337/head -> origin/gh/williamwen42/337/head 2025-12-04T12:46:35.5105971Z * [new branch] gh/williamwen42/337/orig -> origin/gh/williamwen42/337/orig 2025-12-04T12:46:35.5106048Z * [new branch] gh/williamwen42/338/base -> origin/gh/williamwen42/338/base 2025-12-04T12:46:35.5106125Z * [new branch] gh/williamwen42/338/head -> origin/gh/williamwen42/338/head 2025-12-04T12:46:35.5106201Z * [new branch] gh/williamwen42/338/orig -> origin/gh/williamwen42/338/orig 2025-12-04T12:46:35.5106281Z * [new branch] gh/williamwen42/339/base -> origin/gh/williamwen42/339/base 2025-12-04T12:46:35.5106359Z * [new branch] gh/williamwen42/339/head -> origin/gh/williamwen42/339/head 2025-12-04T12:46:35.5106435Z * [new branch] gh/williamwen42/339/orig -> origin/gh/williamwen42/339/orig 2025-12-04T12:46:35.5106512Z * [new branch] gh/williamwen42/340/base -> origin/gh/williamwen42/340/base 2025-12-04T12:46:35.5106589Z * [new branch] gh/williamwen42/340/head -> origin/gh/williamwen42/340/head 2025-12-04T12:46:35.5106666Z * [new branch] gh/williamwen42/340/orig -> origin/gh/williamwen42/340/orig 2025-12-04T12:46:35.5106744Z * [new branch] gh/williamwen42/341/base -> origin/gh/williamwen42/341/base 2025-12-04T12:46:35.5106820Z * [new branch] gh/williamwen42/341/head -> origin/gh/williamwen42/341/head 2025-12-04T12:46:35.5106896Z * [new branch] gh/williamwen42/341/orig -> origin/gh/williamwen42/341/orig 2025-12-04T12:46:35.5106975Z * [new branch] gh/williamwen42/342/base -> origin/gh/williamwen42/342/base 2025-12-04T12:46:35.5107052Z * [new branch] gh/williamwen42/342/head -> origin/gh/williamwen42/342/head 2025-12-04T12:46:35.5107128Z * [new branch] gh/williamwen42/342/orig -> origin/gh/williamwen42/342/orig 2025-12-04T12:46:35.5107205Z * [new branch] gh/williamwen42/343/base -> origin/gh/williamwen42/343/base 2025-12-04T12:46:35.5107282Z * [new branch] gh/williamwen42/343/head -> origin/gh/williamwen42/343/head 2025-12-04T12:46:35.5107359Z * [new branch] gh/williamwen42/343/orig -> origin/gh/williamwen42/343/orig 2025-12-04T12:46:35.5107436Z * [new branch] gh/williamwen42/344/base -> origin/gh/williamwen42/344/base 2025-12-04T12:46:35.5107547Z * [new branch] gh/williamwen42/344/head -> origin/gh/williamwen42/344/head 2025-12-04T12:46:35.5107628Z * [new branch] gh/williamwen42/344/orig -> origin/gh/williamwen42/344/orig 2025-12-04T12:46:35.5107704Z * [new branch] gh/williamwen42/345/base -> origin/gh/williamwen42/345/base 2025-12-04T12:46:35.5107780Z * [new branch] gh/williamwen42/345/head -> origin/gh/williamwen42/345/head 2025-12-04T12:46:35.5107858Z * [new branch] gh/williamwen42/345/orig -> origin/gh/williamwen42/345/orig 2025-12-04T12:46:35.5107934Z * [new branch] gh/williamwen42/346/base -> origin/gh/williamwen42/346/base 2025-12-04T12:46:35.5108010Z * [new branch] gh/williamwen42/346/head -> origin/gh/williamwen42/346/head 2025-12-04T12:46:35.5108124Z * [new branch] gh/williamwen42/346/orig -> origin/gh/williamwen42/346/orig 2025-12-04T12:46:35.5108201Z * [new branch] gh/williamwen42/347/base -> origin/gh/williamwen42/347/base 2025-12-04T12:46:35.5108278Z * [new branch] gh/williamwen42/347/head -> origin/gh/williamwen42/347/head 2025-12-04T12:46:35.5108390Z * [new branch] gh/williamwen42/347/orig -> origin/gh/williamwen42/347/orig 2025-12-04T12:46:35.5108466Z * [new branch] gh/williamwen42/348/base -> origin/gh/williamwen42/348/base 2025-12-04T12:46:35.5108543Z * [new branch] gh/williamwen42/348/head -> origin/gh/williamwen42/348/head 2025-12-04T12:46:35.5108620Z * [new branch] gh/williamwen42/348/orig -> origin/gh/williamwen42/348/orig 2025-12-04T12:46:35.5108696Z * [new branch] gh/williamwen42/349/base -> origin/gh/williamwen42/349/base 2025-12-04T12:46:35.5108774Z * [new branch] gh/williamwen42/349/head -> origin/gh/williamwen42/349/head 2025-12-04T12:46:35.5108851Z * [new branch] gh/williamwen42/349/orig -> origin/gh/williamwen42/349/orig 2025-12-04T12:46:35.5108927Z * [new branch] gh/williamwen42/350/base -> origin/gh/williamwen42/350/base 2025-12-04T12:46:35.5109005Z * [new branch] gh/williamwen42/350/head -> origin/gh/williamwen42/350/head 2025-12-04T12:46:35.5109082Z * [new branch] gh/williamwen42/350/orig -> origin/gh/williamwen42/350/orig 2025-12-04T12:46:35.5109159Z * [new branch] gh/williamwen42/351/base -> origin/gh/williamwen42/351/base 2025-12-04T12:46:35.5109237Z * [new branch] gh/williamwen42/351/head -> origin/gh/williamwen42/351/head 2025-12-04T12:46:35.5109313Z * [new branch] gh/williamwen42/351/orig -> origin/gh/williamwen42/351/orig 2025-12-04T12:46:35.5109390Z * [new branch] gh/williamwen42/352/base -> origin/gh/williamwen42/352/base 2025-12-04T12:46:35.5109468Z * [new branch] gh/williamwen42/352/head -> origin/gh/williamwen42/352/head 2025-12-04T12:46:35.5109545Z * [new branch] gh/williamwen42/352/orig -> origin/gh/williamwen42/352/orig 2025-12-04T12:46:35.5109621Z * [new branch] gh/williamwen42/353/base -> origin/gh/williamwen42/353/base 2025-12-04T12:46:35.5109700Z * [new branch] gh/williamwen42/353/head -> origin/gh/williamwen42/353/head 2025-12-04T12:46:35.5109777Z * [new branch] gh/williamwen42/353/orig -> origin/gh/williamwen42/353/orig 2025-12-04T12:46:35.5109854Z * [new branch] gh/williamwen42/354/base -> origin/gh/williamwen42/354/base 2025-12-04T12:46:35.5109933Z * [new branch] gh/williamwen42/354/head -> origin/gh/williamwen42/354/head 2025-12-04T12:46:35.5110009Z * [new branch] gh/williamwen42/354/orig -> origin/gh/williamwen42/354/orig 2025-12-04T12:46:35.5110087Z * [new branch] gh/williamwen42/355/base -> origin/gh/williamwen42/355/base 2025-12-04T12:46:35.5110164Z * [new branch] gh/williamwen42/355/head -> origin/gh/williamwen42/355/head 2025-12-04T12:46:35.5110240Z * [new branch] gh/williamwen42/355/orig -> origin/gh/williamwen42/355/orig 2025-12-04T12:46:35.5110317Z * [new branch] gh/williamwen42/356/base -> origin/gh/williamwen42/356/base 2025-12-04T12:46:35.5110396Z * [new branch] gh/williamwen42/356/head -> origin/gh/williamwen42/356/head 2025-12-04T12:46:35.5110472Z * [new branch] gh/williamwen42/356/orig -> origin/gh/williamwen42/356/orig 2025-12-04T12:46:35.5110550Z * [new branch] gh/williamwen42/357/base -> origin/gh/williamwen42/357/base 2025-12-04T12:46:35.5110627Z * [new branch] gh/williamwen42/357/head -> origin/gh/williamwen42/357/head 2025-12-04T12:46:35.5110703Z * [new branch] gh/williamwen42/357/orig -> origin/gh/williamwen42/357/orig 2025-12-04T12:46:35.5110802Z * [new branch] gh/williamwen42/358/base -> origin/gh/williamwen42/358/base 2025-12-04T12:46:35.5110880Z * [new branch] gh/williamwen42/358/head -> origin/gh/williamwen42/358/head 2025-12-04T12:46:35.5110957Z * [new branch] gh/williamwen42/358/orig -> origin/gh/williamwen42/358/orig 2025-12-04T12:46:35.5111049Z * [new branch] gh/xmfan/169/base -> origin/gh/xmfan/169/base 2025-12-04T12:46:35.5111118Z * [new branch] gh/xmfan/169/head -> origin/gh/xmfan/169/head 2025-12-04T12:46:35.5111185Z * [new branch] gh/xmfan/170/base -> origin/gh/xmfan/170/base 2025-12-04T12:46:35.5111252Z * [new branch] gh/xmfan/170/head -> origin/gh/xmfan/170/head 2025-12-04T12:46:35.5111318Z * [new branch] gh/xmfan/274/base -> origin/gh/xmfan/274/base 2025-12-04T12:46:35.5111384Z * [new branch] gh/xmfan/274/head -> origin/gh/xmfan/274/head 2025-12-04T12:46:35.5111451Z * [new branch] gh/xmfan/274/orig -> origin/gh/xmfan/274/orig 2025-12-04T12:46:35.5111516Z * [new branch] gh/xmfan/277/base -> origin/gh/xmfan/277/base 2025-12-04T12:46:35.5111581Z * [new branch] gh/xmfan/277/head -> origin/gh/xmfan/277/head 2025-12-04T12:46:35.5111650Z * [new branch] gh/xmfan/277/orig -> origin/gh/xmfan/277/orig 2025-12-04T12:46:35.5111715Z * [new branch] gh/xmfan/301/base -> origin/gh/xmfan/301/base 2025-12-04T12:46:35.5111780Z * [new branch] gh/xmfan/301/head -> origin/gh/xmfan/301/head 2025-12-04T12:46:35.5111847Z * [new branch] gh/xmfan/301/orig -> origin/gh/xmfan/301/orig 2025-12-04T12:46:35.5111912Z * [new branch] gh/xmfan/304/base -> origin/gh/xmfan/304/base 2025-12-04T12:46:35.5111977Z * [new branch] gh/xmfan/304/head -> origin/gh/xmfan/304/head 2025-12-04T12:46:35.5112047Z * [new branch] gh/xmfan/304/orig -> origin/gh/xmfan/304/orig 2025-12-04T12:46:35.5112113Z * [new branch] gh/xmfan/309/base -> origin/gh/xmfan/309/base 2025-12-04T12:46:35.5112179Z * [new branch] gh/xmfan/309/head -> origin/gh/xmfan/309/head 2025-12-04T12:46:35.5112246Z * [new branch] gh/xmfan/309/orig -> origin/gh/xmfan/309/orig 2025-12-04T12:46:35.5112311Z * [new branch] gh/xmfan/310/base -> origin/gh/xmfan/310/base 2025-12-04T12:46:35.5112378Z * [new branch] gh/xmfan/310/head -> origin/gh/xmfan/310/head 2025-12-04T12:46:35.5112443Z * [new branch] gh/xmfan/310/orig -> origin/gh/xmfan/310/orig 2025-12-04T12:46:35.5112508Z * [new branch] gh/xmfan/311/base -> origin/gh/xmfan/311/base 2025-12-04T12:46:35.5112575Z * [new branch] gh/xmfan/311/head -> origin/gh/xmfan/311/head 2025-12-04T12:46:35.5112642Z * [new branch] gh/xmfan/311/orig -> origin/gh/xmfan/311/orig 2025-12-04T12:46:35.5112707Z * [new branch] gh/xmfan/312/base -> origin/gh/xmfan/312/base 2025-12-04T12:46:35.5112773Z * [new branch] gh/xmfan/312/head -> origin/gh/xmfan/312/head 2025-12-04T12:46:35.5112840Z * [new branch] gh/xmfan/312/orig -> origin/gh/xmfan/312/orig 2025-12-04T12:46:35.5112905Z * [new branch] gh/xmfan/313/base -> origin/gh/xmfan/313/base 2025-12-04T12:46:35.5112971Z * [new branch] gh/xmfan/313/head -> origin/gh/xmfan/313/head 2025-12-04T12:46:35.5113037Z * [new branch] gh/xmfan/313/orig -> origin/gh/xmfan/313/orig 2025-12-04T12:46:35.5113114Z * [new branch] gh/xuanzhang816/27/base -> origin/gh/xuanzhang816/27/base 2025-12-04T12:46:35.5113193Z * [new branch] gh/xuanzhang816/27/head -> origin/gh/xuanzhang816/27/head 2025-12-04T12:46:35.5113295Z * [new branch] gh/xuanzhang816/27/orig -> origin/gh/xuanzhang816/27/orig 2025-12-04T12:46:35.5113371Z * [new branch] gh/xuanzhang816/32/base -> origin/gh/xuanzhang816/32/base 2025-12-04T12:46:35.5113447Z * [new branch] gh/xuanzhang816/32/head -> origin/gh/xuanzhang816/32/head 2025-12-04T12:46:35.5113543Z * [new branch] gh/xuanzhang816/32/orig -> origin/gh/xuanzhang816/32/orig 2025-12-04T12:46:35.5113617Z * [new branch] gh/xuanzhang816/33/base -> origin/gh/xuanzhang816/33/base 2025-12-04T12:46:35.5113693Z * [new branch] gh/xuanzhang816/33/head -> origin/gh/xuanzhang816/33/head 2025-12-04T12:46:35.5113767Z * [new branch] gh/xuanzhang816/33/orig -> origin/gh/xuanzhang816/33/orig 2025-12-04T12:46:35.5113842Z * [new branch] gh/xuanzhang816/34/base -> origin/gh/xuanzhang816/34/base 2025-12-04T12:46:35.5113915Z * [new branch] gh/xuanzhang816/34/head -> origin/gh/xuanzhang816/34/head 2025-12-04T12:46:35.5113991Z * [new branch] gh/xuanzhang816/34/orig -> origin/gh/xuanzhang816/34/orig 2025-12-04T12:46:35.5114066Z * [new branch] gh/xuanzhang816/35/base -> origin/gh/xuanzhang816/35/base 2025-12-04T12:46:35.5114141Z * [new branch] gh/xuanzhang816/35/head -> origin/gh/xuanzhang816/35/head 2025-12-04T12:46:35.5114217Z * [new branch] gh/xuanzhang816/35/orig -> origin/gh/xuanzhang816/35/orig 2025-12-04T12:46:35.5114291Z * [new branch] gh/yanbing-j/11/base -> origin/gh/yanbing-j/11/base 2025-12-04T12:46:35.5114361Z * [new branch] gh/yanbing-j/11/head -> origin/gh/yanbing-j/11/head 2025-12-04T12:46:35.5114432Z * [new branch] gh/yanbing-j/11/orig -> origin/gh/yanbing-j/11/orig 2025-12-04T12:46:35.5114503Z * [new branch] gh/yanbing-j/12/base -> origin/gh/yanbing-j/12/base 2025-12-04T12:46:35.5114574Z * [new branch] gh/yanbing-j/12/head -> origin/gh/yanbing-j/12/head 2025-12-04T12:46:35.5114644Z * [new branch] gh/yanbing-j/12/orig -> origin/gh/yanbing-j/12/orig 2025-12-04T12:46:35.5114715Z * [new branch] gh/yanbing-j/13/base -> origin/gh/yanbing-j/13/base 2025-12-04T12:46:35.5114783Z * [new branch] gh/yanbing-j/13/head -> origin/gh/yanbing-j/13/head 2025-12-04T12:46:35.5114853Z * [new branch] gh/yanbing-j/13/orig -> origin/gh/yanbing-j/13/orig 2025-12-04T12:46:35.5114923Z * [new branch] gh/yanbing-j/14/base -> origin/gh/yanbing-j/14/base 2025-12-04T12:46:35.5114991Z * [new branch] gh/yanbing-j/14/head -> origin/gh/yanbing-j/14/head 2025-12-04T12:46:35.5115060Z * [new branch] gh/yanbing-j/14/orig -> origin/gh/yanbing-j/14/orig 2025-12-04T12:46:35.5115130Z * [new branch] gh/yanbing-j/15/base -> origin/gh/yanbing-j/15/base 2025-12-04T12:46:35.5115200Z * [new branch] gh/yanbing-j/15/head -> origin/gh/yanbing-j/15/head 2025-12-04T12:46:35.5115270Z * [new branch] gh/yanbing-j/15/orig -> origin/gh/yanbing-j/15/orig 2025-12-04T12:46:35.5115339Z * [new branch] gh/yanbing-j/18/base -> origin/gh/yanbing-j/18/base 2025-12-04T12:46:35.5115408Z * [new branch] gh/yanbing-j/18/head -> origin/gh/yanbing-j/18/head 2025-12-04T12:46:35.5115480Z * [new branch] gh/yanbing-j/18/orig -> origin/gh/yanbing-j/18/orig 2025-12-04T12:46:35.5115549Z * [new branch] gh/yanbing-j/19/base -> origin/gh/yanbing-j/19/base 2025-12-04T12:46:35.5115619Z * [new branch] gh/yanbing-j/19/head -> origin/gh/yanbing-j/19/head 2025-12-04T12:46:35.5115688Z * [new branch] gh/yanbing-j/19/orig -> origin/gh/yanbing-j/19/orig 2025-12-04T12:46:35.5115757Z * [new branch] gh/yanbing-j/20/base -> origin/gh/yanbing-j/20/base 2025-12-04T12:46:35.5115859Z * [new branch] gh/yanbing-j/20/head -> origin/gh/yanbing-j/20/head 2025-12-04T12:46:35.5115929Z * [new branch] gh/yanbing-j/20/orig -> origin/gh/yanbing-j/20/orig 2025-12-04T12:46:35.5115998Z * [new branch] gh/yanbing-j/21/base -> origin/gh/yanbing-j/21/base 2025-12-04T12:46:35.5116067Z * [new branch] gh/yanbing-j/21/head -> origin/gh/yanbing-j/21/head 2025-12-04T12:46:35.5116188Z * [new branch] gh/yanbing-j/22/base -> origin/gh/yanbing-j/22/base 2025-12-04T12:46:35.5116256Z * [new branch] gh/yanbing-j/22/head -> origin/gh/yanbing-j/22/head 2025-12-04T12:46:35.5116325Z * [new branch] gh/yanbing-j/22/orig -> origin/gh/yanbing-j/22/orig 2025-12-04T12:46:35.5116395Z * [new branch] gh/yanbing-j/23/base -> origin/gh/yanbing-j/23/base 2025-12-04T12:46:35.5116463Z * [new branch] gh/yanbing-j/23/head -> origin/gh/yanbing-j/23/head 2025-12-04T12:46:35.5116533Z * [new branch] gh/yanbing-j/23/orig -> origin/gh/yanbing-j/23/orig 2025-12-04T12:46:35.5116603Z * [new branch] gh/yanbing-j/24/base -> origin/gh/yanbing-j/24/base 2025-12-04T12:46:35.5116672Z * [new branch] gh/yanbing-j/24/head -> origin/gh/yanbing-j/24/head 2025-12-04T12:46:35.5116742Z * [new branch] gh/yanbing-j/24/orig -> origin/gh/yanbing-j/24/orig 2025-12-04T12:46:35.5116813Z * [new branch] gh/yanbing-j/25/base -> origin/gh/yanbing-j/25/base 2025-12-04T12:46:35.5116882Z * [new branch] gh/yanbing-j/25/head -> origin/gh/yanbing-j/25/head 2025-12-04T12:46:35.5116952Z * [new branch] gh/yanbing-j/25/orig -> origin/gh/yanbing-j/25/orig 2025-12-04T12:46:35.5117021Z * [new branch] gh/yanbing-j/26/base -> origin/gh/yanbing-j/26/base 2025-12-04T12:46:35.5117089Z * [new branch] gh/yanbing-j/26/head -> origin/gh/yanbing-j/26/head 2025-12-04T12:46:35.5117161Z * [new branch] gh/yanbing-j/26/orig -> origin/gh/yanbing-j/26/orig 2025-12-04T12:46:35.5117241Z * [new branch] gh/yang-yu-hang/1/base -> origin/gh/yang-yu-hang/1/base 2025-12-04T12:46:35.5117316Z * [new branch] gh/yang-yu-hang/1/head -> origin/gh/yang-yu-hang/1/head 2025-12-04T12:46:35.5117391Z * [new branch] gh/yang-yu-hang/1/orig -> origin/gh/yang-yu-hang/1/orig 2025-12-04T12:46:35.5117464Z * [new branch] gh/yang-yu-hang/2/base -> origin/gh/yang-yu-hang/2/base 2025-12-04T12:46:35.5117566Z * [new branch] gh/yang-yu-hang/2/head -> origin/gh/yang-yu-hang/2/head 2025-12-04T12:46:35.5117640Z * [new branch] gh/yang-yu-hang/2/orig -> origin/gh/yang-yu-hang/2/orig 2025-12-04T12:46:35.5117712Z * [new branch] gh/yang-yu-hang/3/base -> origin/gh/yang-yu-hang/3/base 2025-12-04T12:46:35.5117784Z * [new branch] gh/yang-yu-hang/3/head -> origin/gh/yang-yu-hang/3/head 2025-12-04T12:46:35.5117858Z * [new branch] gh/yang-yu-hang/3/orig -> origin/gh/yang-yu-hang/3/orig 2025-12-04T12:46:35.5117931Z * [new branch] gh/yangw-dev/12/base -> origin/gh/yangw-dev/12/base 2025-12-04T12:46:35.5118002Z * [new branch] gh/yangw-dev/12/head -> origin/gh/yangw-dev/12/head 2025-12-04T12:46:35.5118075Z * [new branch] gh/yangw-dev/12/orig -> origin/gh/yangw-dev/12/orig 2025-12-04T12:46:35.5118144Z * [new branch] gh/yangw-dev/13/base -> origin/gh/yangw-dev/13/base 2025-12-04T12:46:35.5118214Z * [new branch] gh/yangw-dev/13/head -> origin/gh/yangw-dev/13/head 2025-12-04T12:46:35.5118284Z * [new branch] gh/yangw-dev/13/orig -> origin/gh/yangw-dev/13/orig 2025-12-04T12:46:35.5118354Z * [new branch] gh/yangw-dev/14/base -> origin/gh/yangw-dev/14/base 2025-12-04T12:46:35.5118424Z * [new branch] gh/yangw-dev/14/head -> origin/gh/yangw-dev/14/head 2025-12-04T12:46:35.5118550Z * [new branch] gh/yangw-dev/14/orig -> origin/gh/yangw-dev/14/orig 2025-12-04T12:46:35.5118619Z * [new branch] gh/yangw-dev/15/base -> origin/gh/yangw-dev/15/base 2025-12-04T12:46:35.5118690Z * [new branch] gh/yangw-dev/15/head -> origin/gh/yangw-dev/15/head 2025-12-04T12:46:35.5118794Z * [new branch] gh/yangw-dev/15/orig -> origin/gh/yangw-dev/15/orig 2025-12-04T12:46:35.5118863Z * [new branch] gh/yangw-dev/19/base -> origin/gh/yangw-dev/19/base 2025-12-04T12:46:35.5118933Z * [new branch] gh/yangw-dev/19/head -> origin/gh/yangw-dev/19/head 2025-12-04T12:46:35.5119002Z * [new branch] gh/yangw-dev/19/orig -> origin/gh/yangw-dev/19/orig 2025-12-04T12:46:35.5119071Z * [new branch] gh/yangw-dev/26/base -> origin/gh/yangw-dev/26/base 2025-12-04T12:46:35.5119141Z * [new branch] gh/yangw-dev/26/head -> origin/gh/yangw-dev/26/head 2025-12-04T12:46:35.5119214Z * [new branch] gh/yangw-dev/26/orig -> origin/gh/yangw-dev/26/orig 2025-12-04T12:46:35.5119283Z * [new branch] gh/yangw-dev/27/base -> origin/gh/yangw-dev/27/base 2025-12-04T12:46:35.5119354Z * [new branch] gh/yangw-dev/27/head -> origin/gh/yangw-dev/27/head 2025-12-04T12:46:35.5119424Z * [new branch] gh/yangw-dev/27/orig -> origin/gh/yangw-dev/27/orig 2025-12-04T12:46:35.5119491Z * [new branch] gh/ydwu4/292/base -> origin/gh/ydwu4/292/base 2025-12-04T12:46:35.5119559Z * [new branch] gh/ydwu4/292/head -> origin/gh/ydwu4/292/head 2025-12-04T12:46:35.5119625Z * [new branch] gh/ydwu4/292/orig -> origin/gh/ydwu4/292/orig 2025-12-04T12:46:35.5119691Z * [new branch] gh/ydwu4/294/base -> origin/gh/ydwu4/294/base 2025-12-04T12:46:35.5119757Z * [new branch] gh/ydwu4/294/head -> origin/gh/ydwu4/294/head 2025-12-04T12:46:35.5119823Z * [new branch] gh/ydwu4/294/orig -> origin/gh/ydwu4/294/orig 2025-12-04T12:46:35.5119889Z * [new branch] gh/ydwu4/295/base -> origin/gh/ydwu4/295/base 2025-12-04T12:46:35.5119954Z * [new branch] gh/ydwu4/295/head -> origin/gh/ydwu4/295/head 2025-12-04T12:46:35.5120020Z * [new branch] gh/ydwu4/295/orig -> origin/gh/ydwu4/295/orig 2025-12-04T12:46:35.5120086Z * [new branch] gh/ydwu4/296/base -> origin/gh/ydwu4/296/base 2025-12-04T12:46:35.5120151Z * [new branch] gh/ydwu4/296/head -> origin/gh/ydwu4/296/head 2025-12-04T12:46:35.5120215Z * [new branch] gh/ydwu4/296/orig -> origin/gh/ydwu4/296/orig 2025-12-04T12:46:35.5120281Z * [new branch] gh/ydwu4/306/base -> origin/gh/ydwu4/306/base 2025-12-04T12:46:35.5120346Z * [new branch] gh/ydwu4/306/head -> origin/gh/ydwu4/306/head 2025-12-04T12:46:35.5120411Z * [new branch] gh/ydwu4/306/orig -> origin/gh/ydwu4/306/orig 2025-12-04T12:46:35.5120477Z * [new branch] gh/ydwu4/312/base -> origin/gh/ydwu4/312/base 2025-12-04T12:46:35.5120542Z * [new branch] gh/ydwu4/312/head -> origin/gh/ydwu4/312/head 2025-12-04T12:46:35.5120608Z * [new branch] gh/ydwu4/312/orig -> origin/gh/ydwu4/312/orig 2025-12-04T12:46:35.5120673Z * [new branch] gh/ydwu4/322/base -> origin/gh/ydwu4/322/base 2025-12-04T12:46:35.5120738Z * [new branch] gh/ydwu4/322/head -> origin/gh/ydwu4/322/head 2025-12-04T12:46:35.5120804Z * [new branch] gh/ydwu4/322/orig -> origin/gh/ydwu4/322/orig 2025-12-04T12:46:35.5120869Z * [new branch] gh/ydwu4/327/base -> origin/gh/ydwu4/327/base 2025-12-04T12:46:35.5120934Z * [new branch] gh/ydwu4/327/head -> origin/gh/ydwu4/327/head 2025-12-04T12:46:35.5121020Z * [new branch] gh/ydwu4/327/orig -> origin/gh/ydwu4/327/orig 2025-12-04T12:46:35.5121086Z * [new branch] gh/ydwu4/328/base -> origin/gh/ydwu4/328/base 2025-12-04T12:46:35.5121151Z * [new branch] gh/ydwu4/328/head -> origin/gh/ydwu4/328/head 2025-12-04T12:46:35.5121240Z * [new branch] gh/ydwu4/328/orig -> origin/gh/ydwu4/328/orig 2025-12-04T12:46:35.5121306Z * [new branch] gh/ydwu4/329/base -> origin/gh/ydwu4/329/base 2025-12-04T12:46:35.5121371Z * [new branch] gh/ydwu4/329/head -> origin/gh/ydwu4/329/head 2025-12-04T12:46:35.5121436Z * [new branch] gh/ydwu4/329/orig -> origin/gh/ydwu4/329/orig 2025-12-04T12:46:35.5121502Z * [new branch] gh/ydwu4/330/base -> origin/gh/ydwu4/330/base 2025-12-04T12:46:35.5121568Z * [new branch] gh/ydwu4/330/head -> origin/gh/ydwu4/330/head 2025-12-04T12:46:35.5121635Z * [new branch] gh/ydwu4/330/orig -> origin/gh/ydwu4/330/orig 2025-12-04T12:46:35.5121701Z * [new branch] gh/ydwu4/331/base -> origin/gh/ydwu4/331/base 2025-12-04T12:46:35.5121766Z * [new branch] gh/ydwu4/331/head -> origin/gh/ydwu4/331/head 2025-12-04T12:46:35.5121831Z * [new branch] gh/ydwu4/331/orig -> origin/gh/ydwu4/331/orig 2025-12-04T12:46:35.5121898Z * [new branch] gh/ydwu4/332/base -> origin/gh/ydwu4/332/base 2025-12-04T12:46:35.5121963Z * [new branch] gh/ydwu4/332/head -> origin/gh/ydwu4/332/head 2025-12-04T12:46:35.5122028Z * [new branch] gh/ydwu4/332/orig -> origin/gh/ydwu4/332/orig 2025-12-04T12:46:35.5122093Z * [new branch] gh/ydwu4/333/base -> origin/gh/ydwu4/333/base 2025-12-04T12:46:35.5122158Z * [new branch] gh/ydwu4/333/head -> origin/gh/ydwu4/333/head 2025-12-04T12:46:35.5122225Z * [new branch] gh/ydwu4/333/orig -> origin/gh/ydwu4/333/orig 2025-12-04T12:46:35.5122289Z * [new branch] gh/ydwu4/334/base -> origin/gh/ydwu4/334/base 2025-12-04T12:46:35.5122354Z * [new branch] gh/ydwu4/334/head -> origin/gh/ydwu4/334/head 2025-12-04T12:46:35.5122420Z * [new branch] gh/ydwu4/334/orig -> origin/gh/ydwu4/334/orig 2025-12-04T12:46:35.5122486Z * [new branch] gh/ydwu4/335/base -> origin/gh/ydwu4/335/base 2025-12-04T12:46:35.5122551Z * [new branch] gh/ydwu4/335/head -> origin/gh/ydwu4/335/head 2025-12-04T12:46:35.5122616Z * [new branch] gh/ydwu4/335/orig -> origin/gh/ydwu4/335/orig 2025-12-04T12:46:35.5122681Z * [new branch] gh/ydwu4/337/base -> origin/gh/ydwu4/337/base 2025-12-04T12:46:35.5122746Z * [new branch] gh/ydwu4/337/head -> origin/gh/ydwu4/337/head 2025-12-04T12:46:35.5122815Z * [new branch] gh/ydwu4/337/orig -> origin/gh/ydwu4/337/orig 2025-12-04T12:46:35.5122880Z * [new branch] gh/ydwu4/339/base -> origin/gh/ydwu4/339/base 2025-12-04T12:46:35.5122945Z * [new branch] gh/ydwu4/339/head -> origin/gh/ydwu4/339/head 2025-12-04T12:46:35.5123011Z * [new branch] gh/ydwu4/339/orig -> origin/gh/ydwu4/339/orig 2025-12-04T12:46:35.5123077Z * [new branch] gh/yf225/133/base -> origin/gh/yf225/133/base 2025-12-04T12:46:35.5123141Z * [new branch] gh/yf225/133/head -> origin/gh/yf225/133/head 2025-12-04T12:46:35.5123207Z * [new branch] gh/yf225/93/base -> origin/gh/yf225/93/base 2025-12-04T12:46:35.5123272Z * [new branch] gh/yf225/93/head -> origin/gh/yf225/93/head 2025-12-04T12:46:35.5123345Z * [new branch] gh/yifuwang/152/base -> origin/gh/yifuwang/152/base 2025-12-04T12:46:35.5123436Z * [new branch] gh/yifuwang/152/head -> origin/gh/yifuwang/152/head 2025-12-04T12:46:35.5123508Z * [new branch] gh/yifuwang/152/orig -> origin/gh/yifuwang/152/orig 2025-12-04T12:46:35.5123579Z * [new branch] gh/yifuwang/195/base -> origin/gh/yifuwang/195/base 2025-12-04T12:46:35.5123649Z * [new branch] gh/yifuwang/195/head -> origin/gh/yifuwang/195/head 2025-12-04T12:46:35.5123751Z * [new branch] gh/yifuwang/195/orig -> origin/gh/yifuwang/195/orig 2025-12-04T12:46:35.5123823Z * [new branch] gh/yiming0416/1/base -> origin/gh/yiming0416/1/base 2025-12-04T12:46:35.5123893Z * [new branch] gh/yiming0416/1/head -> origin/gh/yiming0416/1/head 2025-12-04T12:46:35.5123962Z * [new branch] gh/yiming0416/2/base -> origin/gh/yiming0416/2/base 2025-12-04T12:46:35.5124033Z * [new branch] gh/yiming0416/2/head -> origin/gh/yiming0416/2/head 2025-12-04T12:46:35.5124106Z * [new branch] gh/yushangdi/1/base -> origin/gh/yushangdi/1/base 2025-12-04T12:46:35.5124178Z * [new branch] gh/yushangdi/1/head -> origin/gh/yushangdi/1/head 2025-12-04T12:46:35.5124250Z * [new branch] gh/yushangdi/10/base -> origin/gh/yushangdi/10/base 2025-12-04T12:46:35.5124320Z * [new branch] gh/yushangdi/10/head -> origin/gh/yushangdi/10/head 2025-12-04T12:46:35.5124393Z * [new branch] gh/yushangdi/10/orig -> origin/gh/yushangdi/10/orig 2025-12-04T12:46:35.5124463Z * [new branch] gh/yushangdi/11/base -> origin/gh/yushangdi/11/base 2025-12-04T12:46:35.5124533Z * [new branch] gh/yushangdi/11/head -> origin/gh/yushangdi/11/head 2025-12-04T12:46:35.5124604Z * [new branch] gh/yushangdi/11/orig -> origin/gh/yushangdi/11/orig 2025-12-04T12:46:35.5124674Z * [new branch] gh/yushangdi/2/base -> origin/gh/yushangdi/2/base 2025-12-04T12:46:35.5124745Z * [new branch] gh/yushangdi/2/head -> origin/gh/yushangdi/2/head 2025-12-04T12:46:35.5124815Z * [new branch] gh/yushangdi/7/base -> origin/gh/yushangdi/7/base 2025-12-04T12:46:35.5124885Z * [new branch] gh/yushangdi/7/head -> origin/gh/yushangdi/7/head 2025-12-04T12:46:35.5124954Z * [new branch] gh/yushangdi/7/orig -> origin/gh/yushangdi/7/orig 2025-12-04T12:46:35.5125025Z * [new branch] gh/yushangdi/8/base -> origin/gh/yushangdi/8/base 2025-12-04T12:46:35.5125095Z * [new branch] gh/yushangdi/8/head -> origin/gh/yushangdi/8/head 2025-12-04T12:46:35.5125164Z * [new branch] gh/yushangdi/8/orig -> origin/gh/yushangdi/8/orig 2025-12-04T12:46:35.5125234Z * [new branch] gh/yushangdi/9/base -> origin/gh/yushangdi/9/base 2025-12-04T12:46:35.5125303Z * [new branch] gh/yushangdi/9/head -> origin/gh/yushangdi/9/head 2025-12-04T12:46:35.5125373Z * [new branch] gh/yushangdi/9/orig -> origin/gh/yushangdi/9/orig 2025-12-04T12:46:35.5125443Z * [new branch] gh/zklaus/19/base -> origin/gh/zklaus/19/base 2025-12-04T12:46:35.5125510Z * [new branch] gh/zklaus/19/head -> origin/gh/zklaus/19/head 2025-12-04T12:46:35.5125576Z * [new branch] gh/zklaus/19/orig -> origin/gh/zklaus/19/orig 2025-12-04T12:46:35.5125644Z * [new branch] gh/zklaus/20/base -> origin/gh/zklaus/20/base 2025-12-04T12:46:35.5125710Z * [new branch] gh/zklaus/20/head -> origin/gh/zklaus/20/head 2025-12-04T12:46:35.5125776Z * [new branch] gh/zklaus/20/orig -> origin/gh/zklaus/20/orig 2025-12-04T12:46:35.5125843Z * [new branch] gh/zklaus/21/base -> origin/gh/zklaus/21/base 2025-12-04T12:46:35.5125908Z * [new branch] gh/zklaus/21/head -> origin/gh/zklaus/21/head 2025-12-04T12:46:35.5125996Z * [new branch] gh/zklaus/21/orig -> origin/gh/zklaus/21/orig 2025-12-04T12:46:35.5126064Z * [new branch] gh/zklaus/22/base -> origin/gh/zklaus/22/base 2025-12-04T12:46:35.5126130Z * [new branch] gh/zklaus/22/head -> origin/gh/zklaus/22/head 2025-12-04T12:46:35.5126195Z * [new branch] gh/zklaus/22/orig -> origin/gh/zklaus/22/orig 2025-12-04T12:46:35.5126287Z * [new branch] gh/zklaus/23/base -> origin/gh/zklaus/23/base 2025-12-04T12:46:35.5126352Z * [new branch] gh/zklaus/23/head -> origin/gh/zklaus/23/head 2025-12-04T12:46:35.5126419Z * [new branch] gh/zklaus/23/orig -> origin/gh/zklaus/23/orig 2025-12-04T12:46:35.5126484Z * [new branch] gh/zklaus/24/base -> origin/gh/zklaus/24/base 2025-12-04T12:46:35.5126549Z * [new branch] gh/zklaus/24/head -> origin/gh/zklaus/24/head 2025-12-04T12:46:35.5126616Z * [new branch] gh/zklaus/24/orig -> origin/gh/zklaus/24/orig 2025-12-04T12:46:35.5126686Z * [new branch] gh/zou3519/1197/base -> origin/gh/zou3519/1197/base 2025-12-04T12:46:35.5126756Z * [new branch] gh/zou3519/1197/head -> origin/gh/zou3519/1197/head 2025-12-04T12:46:35.5126825Z * [new branch] gh/zou3519/1197/orig -> origin/gh/zou3519/1197/orig 2025-12-04T12:46:35.5126894Z * [new branch] gh/zou3519/1199/base -> origin/gh/zou3519/1199/base 2025-12-04T12:46:35.5126962Z * [new branch] gh/zou3519/1199/head -> origin/gh/zou3519/1199/head 2025-12-04T12:46:35.5127030Z * [new branch] gh/zou3519/1199/orig -> origin/gh/zou3519/1199/orig 2025-12-04T12:46:35.5127097Z * [new branch] gh/zou3519/1200/base -> origin/gh/zou3519/1200/base 2025-12-04T12:46:35.5127164Z * [new branch] gh/zou3519/1200/head -> origin/gh/zou3519/1200/head 2025-12-04T12:46:35.5127234Z * [new branch] gh/zou3519/1200/orig -> origin/gh/zou3519/1200/orig 2025-12-04T12:46:35.5127302Z * [new branch] gh/zou3519/1201/base -> origin/gh/zou3519/1201/base 2025-12-04T12:46:35.5127369Z * [new branch] gh/zou3519/1201/head -> origin/gh/zou3519/1201/head 2025-12-04T12:46:35.5127438Z * [new branch] gh/zou3519/1201/orig -> origin/gh/zou3519/1201/orig 2025-12-04T12:46:35.5127547Z * [new branch] gh/zou3519/1202/base -> origin/gh/zou3519/1202/base 2025-12-04T12:46:35.5127616Z * [new branch] gh/zou3519/1202/head -> origin/gh/zou3519/1202/head 2025-12-04T12:46:35.5127685Z * [new branch] gh/zou3519/1202/orig -> origin/gh/zou3519/1202/orig 2025-12-04T12:46:35.5127752Z * [new branch] gh/zpcore/1/base -> origin/gh/zpcore/1/base 2025-12-04T12:46:35.5127820Z * [new branch] gh/zpcore/1/head -> origin/gh/zpcore/1/head 2025-12-04T12:46:35.5127888Z * [new branch] gh/zpcore/11/base -> origin/gh/zpcore/11/base 2025-12-04T12:46:35.5127955Z * [new branch] gh/zpcore/11/head -> origin/gh/zpcore/11/head 2025-12-04T12:46:35.5128022Z * [new branch] gh/zpcore/11/orig -> origin/gh/zpcore/11/orig 2025-12-04T12:46:35.5128088Z * [new branch] gh/zpcore/12/base -> origin/gh/zpcore/12/base 2025-12-04T12:46:35.5128157Z * [new branch] gh/zpcore/12/head -> origin/gh/zpcore/12/head 2025-12-04T12:46:35.5128223Z * [new branch] gh/zpcore/12/orig -> origin/gh/zpcore/12/orig 2025-12-04T12:46:35.5128289Z * [new branch] gh/zpcore/13/base -> origin/gh/zpcore/13/base 2025-12-04T12:46:35.5128355Z * [new branch] gh/zpcore/13/head -> origin/gh/zpcore/13/head 2025-12-04T12:46:35.5128421Z * [new branch] gh/zpcore/13/orig -> origin/gh/zpcore/13/orig 2025-12-04T12:46:35.5128487Z * [new branch] gh/zpcore/14/base -> origin/gh/zpcore/14/base 2025-12-04T12:46:35.5128587Z * [new branch] gh/zpcore/14/head -> origin/gh/zpcore/14/head 2025-12-04T12:46:35.5128655Z * [new branch] gh/zpcore/14/orig -> origin/gh/zpcore/14/orig 2025-12-04T12:46:35.5128722Z * [new branch] gh/zpcore/15/base -> origin/gh/zpcore/15/base 2025-12-04T12:46:35.5128816Z * [new branch] gh/zpcore/15/head -> origin/gh/zpcore/15/head 2025-12-04T12:46:35.5128884Z * [new branch] gh/zpcore/15/orig -> origin/gh/zpcore/15/orig 2025-12-04T12:46:35.5128950Z * [new branch] gh/zpcore/2/base -> origin/gh/zpcore/2/base 2025-12-04T12:46:35.5129015Z * [new branch] gh/zpcore/2/head -> origin/gh/zpcore/2/head 2025-12-04T12:46:35.5129083Z * [new branch] gh/zpcore/21/base -> origin/gh/zpcore/21/base 2025-12-04T12:46:35.5129149Z * [new branch] gh/zpcore/21/head -> origin/gh/zpcore/21/head 2025-12-04T12:46:35.5129215Z * [new branch] gh/zpcore/21/orig -> origin/gh/zpcore/21/orig 2025-12-04T12:46:35.5129282Z * [new branch] gh/zpcore/22/base -> origin/gh/zpcore/22/base 2025-12-04T12:46:35.5129348Z * [new branch] gh/zpcore/22/head -> origin/gh/zpcore/22/head 2025-12-04T12:46:35.5129417Z * [new branch] gh/zpcore/22/orig -> origin/gh/zpcore/22/orig 2025-12-04T12:46:35.5129482Z * [new branch] gh/zpcore/23/base -> origin/gh/zpcore/23/base 2025-12-04T12:46:35.5129548Z * [new branch] gh/zpcore/23/head -> origin/gh/zpcore/23/head 2025-12-04T12:46:35.5129615Z * [new branch] gh/zpcore/23/orig -> origin/gh/zpcore/23/orig 2025-12-04T12:46:35.5129680Z * [new branch] gh/zpcore/24/base -> origin/gh/zpcore/24/base 2025-12-04T12:46:35.5129746Z * [new branch] gh/zpcore/24/head -> origin/gh/zpcore/24/head 2025-12-04T12:46:35.5129814Z * [new branch] gh/zpcore/24/orig -> origin/gh/zpcore/24/orig 2025-12-04T12:46:35.5129880Z * [new branch] gh/zpcore/25/base -> origin/gh/zpcore/25/base 2025-12-04T12:46:35.5129946Z * [new branch] gh/zpcore/25/head -> origin/gh/zpcore/25/head 2025-12-04T12:46:35.5130014Z * [new branch] gh/zpcore/25/orig -> origin/gh/zpcore/25/orig 2025-12-04T12:46:35.5130080Z * [new branch] gh/zpcore/26/base -> origin/gh/zpcore/26/base 2025-12-04T12:46:35.5130146Z * [new branch] gh/zpcore/26/head -> origin/gh/zpcore/26/head 2025-12-04T12:46:35.5130213Z * [new branch] gh/zpcore/26/orig -> origin/gh/zpcore/26/orig 2025-12-04T12:46:35.5130279Z * [new branch] gh/zpcore/27/base -> origin/gh/zpcore/27/base 2025-12-04T12:46:35.5130345Z * [new branch] gh/zpcore/27/head -> origin/gh/zpcore/27/head 2025-12-04T12:46:35.5130413Z * [new branch] gh/zpcore/27/orig -> origin/gh/zpcore/27/orig 2025-12-04T12:46:35.5130479Z * [new branch] gh/zpcore/28/base -> origin/gh/zpcore/28/base 2025-12-04T12:46:35.5130545Z * [new branch] gh/zpcore/28/head -> origin/gh/zpcore/28/head 2025-12-04T12:46:35.5130612Z * [new branch] gh/zpcore/28/orig -> origin/gh/zpcore/28/orig 2025-12-04T12:46:35.5130679Z * [new branch] gh/zpcore/3/base -> origin/gh/zpcore/3/base 2025-12-04T12:46:35.5130744Z * [new branch] gh/zpcore/3/head -> origin/gh/zpcore/3/head 2025-12-04T12:46:35.5130811Z * [new branch] gh/zpcore/4/base -> origin/gh/zpcore/4/base 2025-12-04T12:46:35.5130876Z * [new branch] gh/zpcore/4/head -> origin/gh/zpcore/4/head 2025-12-04T12:46:35.5130941Z * [new branch] gh/zpcore/5/base -> origin/gh/zpcore/5/base 2025-12-04T12:46:35.5131032Z * [new branch] gh/zpcore/5/head -> origin/gh/zpcore/5/head 2025-12-04T12:46:35.5131098Z * [new branch] gh/zpcore/6/base -> origin/gh/zpcore/6/base 2025-12-04T12:46:35.5131164Z * [new branch] gh/zpcore/6/head -> origin/gh/zpcore/6/head 2025-12-04T12:46:35.5131229Z * [new branch] gh/zpcore/7/base -> origin/gh/zpcore/7/base 2025-12-04T12:46:35.5131319Z * [new branch] gh/zpcore/7/head -> origin/gh/zpcore/7/head 2025-12-04T12:46:35.5131387Z * [new branch] gh/zpcore/8/base -> origin/gh/zpcore/8/base 2025-12-04T12:46:35.5131452Z * [new branch] gh/zpcore/8/head -> origin/gh/zpcore/8/head 2025-12-04T12:46:35.5131520Z * [new branch] google-main -> origin/google-main 2025-12-04T12:46:35.5131610Z * [new branch] guangyey/external_stream -> origin/guangyey/external_stream 2025-12-04T12:46:35.5131682Z * [new branch] guangyey/test_2025 -> origin/guangyey/test_2025 2025-12-04T12:46:35.5131821Z * [new branch] guilhermeleobas/cherry-pick-55d87d9dfd9 -> origin/guilhermeleobas/cherry-pick-55d87d9dfd9 2025-12-04T12:46:35.5131938Z * [new branch] hameerabbasi/complex_tensor_subclass -> origin/hameerabbasi/complex_tensor_subclass 2025-12-04T12:46:35.5132080Z * [new branch] hameerabbasi/fix-ctensor-gradcheck-tests -> origin/hameerabbasi/fix-ctensor-gradcheck-tests 2025-12-04T12:46:35.5132188Z * [new branch] hameerabbasi/gradcheck-allclose -> origin/hameerabbasi/gradcheck-allclose 2025-12-04T12:46:35.5132252Z * [new branch] hc_baseline -> origin/hc_baseline 2025-12-04T12:46:35.5132314Z * [new branch] hhh_rand -> origin/hhh_rand 2025-12-04T12:46:35.5132377Z * [new branch] huba/f1 -> origin/huba/f1 2025-12-04T12:46:35.5132566Z * [new branch] increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test -> origin/increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test 2025-12-04T12:46:35.5132627Z * [new branch] inlining -> origin/inlining 2025-12-04T12:46:35.5132698Z * [new branch] inlining-ezyang -> origin/inlining-ezyang 2025-12-04T12:46:35.5132781Z * [new branch] install-torchao-0.13.0 -> origin/install-torchao-0.13.0 2025-12-04T12:46:35.5132960Z * [new branch] instrument-trunk-pull-linux-with-job-test-filters -> origin/instrument-trunk-pull-linux-with-job-test-filters 2025-12-04T12:46:35.5133032Z * [new branch] invoke-subgraph -> origin/invoke-subgraph 2025-12-04T12:46:35.5133096Z * [new branch] issue#58739 -> origin/issue#58739 2025-12-04T12:46:35.5133174Z * [new branch] jainapurva-patch-1 -> origin/jainapurva-patch-1 2025-12-04T12:46:35.5133235Z * [new branch] jathu/o3 -> origin/jathu/o3 2025-12-04T12:46:35.5133297Z * [new branch] jathu/sve -> origin/jathu/sve 2025-12-04T12:46:35.5133421Z * [new branch] jcaip/test-cusparselt-version-0.6.2 -> origin/jcaip/test-cusparselt-version-0.6.2 2025-12-04T12:46:35.5133527Z * [new branch] jcaip/update-cusparselt-0.6.2 -> origin/jcaip/update-cusparselt-0.6.2 2025-12-04T12:46:35.5133640Z * [new branch] jiannanWang/memorysnapshot_filter -> origin/jiannanWang/memorysnapshot_filter 2025-12-04T12:46:35.5133749Z * [new branch] jiannanWang/profilerstepwarning -> origin/jiannanWang/profilerstepwarning 2025-12-04T12:46:35.5133835Z * [new branch] jithunnair-amd-patch-1 -> origin/jithunnair-amd-patch-1 2025-12-04T12:46:35.5133920Z * [new branch] jithunnair-amd-patch-10 -> origin/jithunnair-amd-patch-10 2025-12-04T12:46:35.5134002Z * [new branch] jithunnair-amd-patch-2 -> origin/jithunnair-amd-patch-2 2025-12-04T12:46:35.5134105Z * [new branch] jithunnair-amd-patch-3 -> origin/jithunnair-amd-patch-3 2025-12-04T12:46:35.5134185Z * [new branch] jithunnair-amd-patch-4 -> origin/jithunnair-amd-patch-4 2025-12-04T12:46:35.5134263Z * [new branch] jithunnair-amd-patch-5 -> origin/jithunnair-amd-patch-5 2025-12-04T12:46:35.5134374Z * [new branch] jithunnair-amd-patch-6 -> origin/jithunnair-amd-patch-6 2025-12-04T12:46:35.5134452Z * [new branch] jithunnair-amd-patch-7 -> origin/jithunnair-amd-patch-7 2025-12-04T12:46:35.5134531Z * [new branch] jithunnair-amd-patch-8 -> origin/jithunnair-amd-patch-8 2025-12-04T12:46:35.5134609Z * [new branch] jithunnair-amd-patch-9 -> origin/jithunnair-amd-patch-9 2025-12-04T12:46:35.5134685Z * [new branch] justinchu/native-qdq -> origin/justinchu/native-qdq 2025-12-04T12:46:35.5134758Z * [new branch] kainan666/xlf_debug -> origin/kainan666/xlf_debug 2025-12-04T12:46:35.5134821Z * [new branch] kainan_test -> origin/kainan_test 2025-12-04T12:46:35.5134898Z * [new branch] larryliu0820-patch-1 -> origin/larryliu0820-patch-1 2025-12-04T12:46:35.5135003Z * [new branch] leslie/test_group_gemm_epilogues -> origin/leslie/test_group_gemm_epilogues 2025-12-04T12:46:35.5135109Z * [new branch] lessw2020/fix_cutlass_cache_error -> origin/lessw2020/fix_cutlass_cache_error 2025-12-04T12:46:35.5135186Z * [new branch] liaoxuan/shm_all_reduce -> origin/liaoxuan/shm_all_reduce 2025-12-04T12:46:35.5135289Z * [new branch] liaoxuan/test_fa_disable_softmax -> origin/liaoxuan/test_fa_disable_softmax 2025-12-04T12:46:35.5135367Z * [new branch] liaoxuan/test_int8_sdpa -> origin/liaoxuan/test_int8_sdpa 2025-12-04T12:46:35.5135435Z * [new branch] llama4-stable -> origin/llama4-stable 2025-12-04T12:46:35.5135504Z * [new branch] lts/release/1.8 -> origin/lts/release/1.8 2025-12-04T12:46:35.5135576Z * [new branch] lucaskabela/#94773 -> origin/lucaskabela/#94773 2025-12-04T12:46:35.5135651Z * [new branch] lucaskabela/fix_164876 -> origin/lucaskabela/fix_164876 2025-12-04T12:46:35.5135735Z * [new branch] lucaskabela/flop_counter -> origin/lucaskabela/flop_counter 2025-12-04T12:46:35.5135832Z * [new branch] lucaskabela/func_under_decomp -> origin/lucaskabela/func_under_decomp 2025-12-04T12:46:35.5135937Z * [new branch] lucaskabela/functional_in_dynamo -> origin/lucaskabela/functional_in_dynamo 2025-12-04T12:46:35.5136062Z * [new branch] lucaskabela/install_params_as_graph_attr -> origin/lucaskabela/install_params_as_graph_attr 2025-12-04T12:46:35.5136175Z * [new branch] lucaskabela/parameters_as_graph_attr -> origin/lucaskabela/parameters_as_graph_attr 2025-12-04T12:46:35.5136309Z * [new branch] lucaskabela/remove_aot_dispatcher_metadata -> origin/lucaskabela/remove_aot_dispatcher_metadata 2025-12-04T12:46:35.5136387Z * [new branch] lucaskabela/rnn_decomp -> origin/lucaskabela/rnn_decomp 2025-12-04T12:46:35.5136479Z * [new branch] lucaskabela/typing_backends -> origin/lucaskabela/typing_backends 2025-12-04T12:46:35.5136579Z * [new branch] lucaskabela/typing_ctx_manager -> origin/lucaskabela/typing_ctx_manager 2025-12-04T12:46:35.5136672Z * [new branch] lucaskabela/typing_nn_module -> origin/lucaskabela/typing_nn_module 2025-12-04T12:46:35.5136773Z * [new branch] lucaskabela/typing_user_defined -> origin/lucaskabela/typing_user_defined 2025-12-04T12:46:35.5136869Z * [new branch] lucaskabela/typing_variables -> origin/lucaskabela/typing_variables 2025-12-04T12:46:35.5136978Z * [new branch] lucaskabela/typing_variables_dicts -> origin/lucaskabela/typing_variables_dicts 2025-12-04T12:46:35.5137122Z * [new branch] lucaskabela/typing_variables_functions -> origin/lucaskabela/typing_variables_functions 2025-12-04T12:46:35.5137231Z * [new branch] lucaskabela/typing_variables_lists -> origin/lucaskabela/typing_variables_lists 2025-12-04T12:46:35.5137304Z * [new branch] lw/torch_box_by_ref -> origin/lw/torch_box_by_ref 2025-12-04T12:46:35.5137389Z * [new branch] main -> origin/main 2025-12-04T12:46:35.5137460Z * [new branch] malfet-patch-1 -> origin/malfet-patch-1 2025-12-04T12:46:35.5137572Z * [new branch] malfet-patch-2 -> origin/malfet-patch-2 2025-12-04T12:46:35.5137642Z * [new branch] malfet-patch-3 -> origin/malfet-patch-3 2025-12-04T12:46:35.5137707Z * [new branch] malfet-patch-4 -> origin/malfet-patch-4 2025-12-04T12:46:35.5137773Z * [new branch] malfet-patch-5 -> origin/malfet-patch-5 2025-12-04T12:46:35.5137840Z * [new branch] malfet-patch-6 -> origin/malfet-patch-6 2025-12-04T12:46:35.5137905Z * [new branch] malfet-patch-7 -> origin/malfet-patch-7 2025-12-04T12:46:35.5137970Z * [new branch] malfet-patch-8 -> origin/malfet-patch-8 2025-12-04T12:46:35.5138045Z * [new branch] malfet/add-3.14-ci -> origin/malfet/add-3.14-ci 2025-12-04T12:46:35.5138208Z * [new branch] malfet/be-do-not-make-typos-in-build-artifacts -> origin/malfet/be-do-not-make-typos-in-build-artifacts 2025-12-04T12:46:35.5138374Z * [new branch] malfet/be-move-more-settings-to-checkout-pytorch -> origin/malfet/be-move-more-settings-to-checkout-pytorch 2025-12-04T12:46:35.5138502Z * [new branch] malfet/be-remove-misisng-neon-headers -> origin/malfet/be-remove-misisng-neon-headers 2025-12-04T12:46:35.5138602Z * [new branch] malfet/mps-implement-col2im -> origin/malfet/mps-implement-col2im 2025-12-04T12:46:35.5138718Z * [new branch] manuel/aoti_metal_shimify-thread_safe -> origin/manuel/aoti_metal_shimify-thread_safe 2025-12-04T12:46:35.5138810Z * [new branch] manuel/inductor_link_openmp -> origin/manuel/inductor_link_openmp 2025-12-04T12:46:35.5138883Z * [new branch] masnesral/metaconda -> origin/masnesral/metaconda 2025-12-04T12:46:35.5138960Z * [new branch] mem_profiler_flaky_fix -> origin/mem_profiler_flaky_fix 2025-12-04T12:46:35.5139040Z * [new branch] mem_profiler_stack_trace -> origin/mem_profiler_stack_trace 2025-12-04T12:46:35.5139114Z * [new branch] memory_profiler_stack -> origin/memory_profiler_stack 2025-12-04T12:46:35.5139189Z * [new branch] metascroy-patch-1 -> origin/metascroy-patch-1 2025-12-04T12:46:35.5139252Z * [new branch] mingw_posix -> origin/mingw_posix 2025-12-04T12:46:35.5139326Z * [new branch] mlazos/S429861-debug -> origin/mlazos/S429861-debug 2025-12-04T12:46:35.5139387Z * [new branch] mlazos/aa -> origin/mlazos/aa 2025-12-04T12:46:35.5139449Z * [new branch] mlazos/acts -> origin/mlazos/acts 2025-12-04T12:46:35.5139521Z * [new branch] mlazos/arg-renames -> origin/mlazos/arg-renames 2025-12-04T12:46:35.5139600Z * [new branch] mlazos/bad-cudagraphs -> origin/mlazos/bad-cudagraphs 2025-12-04T12:46:35.5139699Z * [new branch] mlazos/baseline-graph-breaks -> origin/mlazos/baseline-graph-breaks 2025-12-04T12:46:35.5139771Z * [new branch] mlazos/beta-tensor -> origin/mlazos/beta-tensor 2025-12-04T12:46:35.5139837Z * [new branch] mlazos/buffers -> origin/mlazos/buffers 2025-12-04T12:46:35.5139903Z * [new branch] mlazos/buffers2 -> origin/mlazos/buffers2 2025-12-04T12:46:35.5140004Z * [new branch] mlazos/buffers3 -> origin/mlazos/buffers3 2025-12-04T12:46:35.5140068Z * [new branch] mlazos/bwd -> origin/mlazos/bwd 2025-12-04T12:46:35.5140138Z * [new branch] mlazos/combo-test -> origin/mlazos/combo-test 2025-12-04T12:46:35.5140210Z * [new branch] mlazos/ctx-cleanup -> origin/mlazos/ctx-cleanup 2025-12-04T12:46:35.5140321Z * [new branch] mlazos/cuda-cmd-log -> origin/mlazos/cuda-cmd-log 2025-12-04T12:46:35.5140401Z * [new branch] mlazos/cudagraph-tests -> origin/mlazos/cudagraph-tests 2025-12-04T12:46:35.5140504Z * [new branch] mlazos/cudagraphs-measurement -> origin/mlazos/cudagraphs-measurement 2025-12-04T12:46:35.5140578Z * [new branch] mlazos/cutlass-test -> origin/mlazos/cutlass-test 2025-12-04T12:46:35.5140658Z * [new branch] mlazos/cutlass-topo-bug -> origin/mlazos/cutlass-topo-bug 2025-12-04T12:46:35.5140739Z * [new branch] mlazos/dataclass-proxy -> origin/mlazos/dataclass-proxy 2025-12-04T12:46:35.5140807Z * [new branch] mlazos/dc-attrs -> origin/mlazos/dc-attrs 2025-12-04T12:46:35.5140875Z * [new branch] mlazos/dc-helion -> origin/mlazos/dc-helion 2025-12-04T12:46:35.5140942Z * [new branch] mlazos/dict-fix -> origin/mlazos/dict-fix 2025-12-04T12:46:35.5141012Z * [new branch] mlazos/disable-tf -> origin/mlazos/disable-tf 2025-12-04T12:46:35.5141078Z * [new branch] mlazos/dupe-fix -> origin/mlazos/dupe-fix 2025-12-04T12:46:35.5141146Z * [new branch] mlazos/dyn-batch -> origin/mlazos/dyn-batch 2025-12-04T12:46:35.5141208Z * [new branch] mlazos/evt -> origin/mlazos/evt 2025-12-04T12:46:35.5141287Z * [new branch] mlazos/extract-examples -> origin/mlazos/extract-examples 2025-12-04T12:46:35.5141359Z * [new branch] mlazos/foreach-op -> origin/mlazos/foreach-op 2025-12-04T12:46:35.5141420Z * [new branch] mlazos/fp8 -> origin/mlazos/fp8 2025-12-04T12:46:35.5141487Z * [new branch] mlazos/fp8-bias -> origin/mlazos/fp8-bias 2025-12-04T12:46:35.5141567Z * [new branch] mlazos/fp8-bias-fusion -> origin/mlazos/fp8-bias-fusion 2025-12-04T12:46:35.5141636Z * [new branch] mlazos/fp8-fixes -> origin/mlazos/fp8-fixes 2025-12-04T12:46:35.5141702Z * [new branch] mlazos/freezing -> origin/mlazos/freezing 2025-12-04T12:46:35.5141769Z * [new branch] mlazos/h-comp -> origin/mlazos/h-comp 2025-12-04T12:46:35.5141836Z * [new branch] mlazos/h-comp2 -> origin/mlazos/h-comp2 2025-12-04T12:46:35.5141904Z * [new branch] mlazos/hash-hop -> origin/mlazos/hash-hop 2025-12-04T12:46:35.5141965Z * [new branch] mlazos/hc -> origin/mlazos/hc 2025-12-04T12:46:35.5142034Z * [new branch] mlazos/hc-cycles -> origin/mlazos/hc-cycles 2025-12-04T12:46:35.5142100Z * [new branch] mlazos/hc-fixes -> origin/mlazos/hc-fixes 2025-12-04T12:46:35.5142167Z * [new branch] mlazos/hc-fixes3 -> origin/mlazos/hc-fixes3 2025-12-04T12:46:35.5142234Z * [new branch] mlazos/hc-fixes4 -> origin/mlazos/hc-fixes4 2025-12-04T12:46:35.5142298Z * [new branch] mlazos/hc-hf -> origin/mlazos/hc-hf 2025-12-04T12:46:35.5142363Z * [new branch] mlazos/hc-mut -> origin/mlazos/hc-mut 2025-12-04T12:46:35.5142424Z * [new branch] mlazos/hc10 -> origin/mlazos/hc10 2025-12-04T12:46:35.5142486Z * [new branch] mlazos/hc11 -> origin/mlazos/hc11 2025-12-04T12:46:35.5142547Z * [new branch] mlazos/hc12 -> origin/mlazos/hc12 2025-12-04T12:46:35.5142631Z * [new branch] mlazos/hc13 -> origin/mlazos/hc13 2025-12-04T12:46:35.5142692Z * [new branch] mlazos/hc14 -> origin/mlazos/hc14 2025-12-04T12:46:35.5142751Z * [new branch] mlazos/hc15 -> origin/mlazos/hc15 2025-12-04T12:46:35.5142812Z * [new branch] mlazos/hc2 -> origin/mlazos/hc2 2025-12-04T12:46:35.5142899Z * [new branch] mlazos/hc4 -> origin/mlazos/hc4 2025-12-04T12:46:35.5142959Z * [new branch] mlazos/hc5 -> origin/mlazos/hc5 2025-12-04T12:46:35.5143018Z * [new branch] mlazos/hc6 -> origin/mlazos/hc6 2025-12-04T12:46:35.5143078Z * [new branch] mlazos/hc7 -> origin/mlazos/hc7 2025-12-04T12:46:35.5143136Z * [new branch] mlazos/hc8 -> origin/mlazos/hc8 2025-12-04T12:46:35.5143195Z * [new branch] mlazos/hc9 -> origin/mlazos/hc9 2025-12-04T12:46:35.5143267Z * [new branch] mlazos/hc_baseline2 -> origin/mlazos/hc_baseline2 2025-12-04T12:46:35.5143349Z * [new branch] mlazos/inductor-streams -> origin/mlazos/inductor-streams 2025-12-04T12:46:35.5143410Z * [new branch] mlazos/main -> origin/mlazos/main 2025-12-04T12:46:35.5143472Z * [new branch] mlazos/mcg2 -> origin/mlazos/mcg2 2025-12-04T12:46:35.5143545Z * [new branch] mlazos/meta-guards -> origin/mlazos/meta-guards 2025-12-04T12:46:35.5143648Z * [new branch] mlazos/mlazos/foreach-map-adam -> origin/mlazos/mlazos/foreach-map-adam 2025-12-04T12:46:35.5143745Z * [new branch] mlazos/mlazos/tf-mode-backup -> origin/mlazos/mlazos/tf-mode-backup 2025-12-04T12:46:35.5143811Z * [new branch] mlazos/mod-fix -> origin/mlazos/mod-fix 2025-12-04T12:46:35.5143877Z * [new branch] mlazos/mode-fix -> origin/mlazos/mode-fix 2025-12-04T12:46:35.5143944Z * [new branch] mlazos/offsets -> origin/mlazos/offsets 2025-12-04T12:46:35.5144018Z * [new branch] mlazos/overguarding -> origin/mlazos/overguarding 2025-12-04T12:46:35.5144091Z * [new branch] mlazos/proxy-ctors -> origin/mlazos/proxy-ctors 2025-12-04T12:46:35.5144162Z * [new branch] mlazos/quant-fix -> origin/mlazos/quant-fix 2025-12-04T12:46:35.5144231Z * [new branch] mlazos/resnet-fix -> origin/mlazos/resnet-fix 2025-12-04T12:46:35.5144302Z * [new branch] mlazos/rm-buf-names -> origin/mlazos/rm-buf-names 2025-12-04T12:46:35.5144368Z * [new branch] mlazos/rm-code -> origin/mlazos/rm-code 2025-12-04T12:46:35.5144433Z * [new branch] mlazos/rm-spam -> origin/mlazos/rm-spam 2025-12-04T12:46:35.5144495Z * [new branch] mlazos/rtp -> origin/mlazos/rtp 2025-12-04T12:46:35.5144573Z * [new branch] mlazos/static-idx-dbg -> origin/mlazos/static-idx-dbg 2025-12-04T12:46:35.5144659Z * [new branch] mlazos/static-inputs-log -> origin/mlazos/static-inputs-log 2025-12-04T12:46:35.5144724Z * [new branch] mlazos/stests -> origin/mlazos/stests 2025-12-04T12:46:35.5144794Z * [new branch] mlazos/stream-ops -> origin/mlazos/stream-ops 2025-12-04T12:46:35.5144859Z * [new branch] mlazos/td-fix2 -> origin/mlazos/td-fix2 2025-12-04T12:46:35.5144937Z * [new branch] mlazos/tensor-hasattr2 -> origin/mlazos/tensor-hasattr2 2025-12-04T12:46:35.5144998Z * [new branch] mlazos/test -> origin/mlazos/test 2025-12-04T12:46:35.5145063Z * [new branch] mlazos/tf-mode -> origin/mlazos/tf-mode 2025-12-04T12:46:35.5145142Z * [new branch] mlazos/tf-mode-backup2 -> origin/mlazos/tf-mode-backup2 2025-12-04T12:46:35.5145248Z * [new branch] mlazos/tf-mode-reland -> origin/mlazos/tf-mode-reland 2025-12-04T12:46:35.5145325Z * [new branch] mlazos/tf-mode-reland2 -> origin/mlazos/tf-mode-reland2 2025-12-04T12:46:35.5145401Z * [new branch] mlazos/tf-mode-reland3 -> origin/mlazos/tf-mode-reland3 2025-12-04T12:46:35.5145477Z * [new branch] mlazos/triton-no-epi -> origin/mlazos/triton-no-epi 2025-12-04T12:46:35.5145570Z * [new branch] mlazos/tune-proto -> origin/mlazos/tune-proto 2025-12-04T12:46:35.5145643Z * [new branch] mlazos/tuple-fixes -> origin/mlazos/tuple-fixes 2025-12-04T12:46:35.5145716Z * [new branch] mlazos/tuple-fixes2 -> origin/mlazos/tuple-fixes2 2025-12-04T12:46:35.5145792Z * [new branch] mlazos/tuple-handling -> origin/mlazos/tuple-handling 2025-12-04T12:46:35.5145872Z * [new branch] mlazos/user-stream-base -> origin/mlazos/user-stream-base 2025-12-04T12:46:35.5145945Z * [new branch] mlazos/user-streams -> origin/mlazos/user-streams 2025-12-04T12:46:35.5146037Z * [new branch] mlazos/user-streams-backup -> origin/mlazos/user-streams-backup 2025-12-04T12:46:35.5146134Z * [new branch] mlazos/user-streams-backup2 -> origin/mlazos/user-streams-backup2 2025-12-04T12:46:35.5146203Z * [new branch] mlazos/vary-beta -> origin/mlazos/vary-beta 2025-12-04T12:46:35.5146273Z * [new branch] mlazos/vary-beta2 -> origin/mlazos/vary-beta2 2025-12-04T12:46:35.5146345Z * [new branch] mlazos/weird-perf1 -> origin/mlazos/weird-perf1 2025-12-04T12:46:35.5146416Z * [new branch] mm_out_dtype_compile -> origin/mm_out_dtype_compile 2025-12-04T12:46:35.5146480Z * [new branch] module-shim -> origin/module-shim 2025-12-04T12:46:35.5146541Z * [new branch] move_config -> origin/move_config 2025-12-04T12:46:35.5146610Z * [new branch] msaroufim/reduce -> origin/msaroufim/reduce 2025-12-04T12:46:35.5146680Z * [new branch] mtia/basic-cmake -> origin/mtia/basic-cmake 2025-12-04T12:46:35.5146782Z * [new branch] mwizak/fix-triton-block-shape -> origin/mwizak/fix-triton-block-shape 2025-12-04T12:46:35.5146851Z * [new branch] my_varlen_backup -> origin/my_varlen_backup 2025-12-04T12:46:35.5146926Z * [new branch] nativert_num_outputs -> origin/nativert_num_outputs 2025-12-04T12:46:35.5146989Z * [new branch] new-codegen -> origin/new-codegen 2025-12-04T12:46:35.5147055Z * [new branch] newtest-base -> origin/newtest-base 2025-12-04T12:46:35.5147126Z * [new branch] ngimel/addmm_dtype -> origin/ngimel/addmm_dtype 2025-12-04T12:46:35.5147190Z * [new branch] ngimel/div_inv -> origin/ngimel/div_inv 2025-12-04T12:46:35.5147269Z * [new branch] ngimel/error_index_list -> origin/ngimel/error_index_list 2025-12-04T12:46:35.5147341Z * [new branch] ngimel/gather_grid -> origin/ngimel/gather_grid 2025-12-04T12:46:35.5147429Z * [new branch] ngimel/gather_grid_release -> origin/ngimel/gather_grid_release 2025-12-04T12:46:35.5147531Z * [new branch] ngimel/gg_new -> origin/ngimel/gg_new 2025-12-04T12:46:35.5147601Z * [new branch] ngimel/hostalloc -> origin/ngimel/hostalloc 2025-12-04T12:46:35.5147670Z * [new branch] ngimel/storage_id -> origin/ngimel/storage_id 2025-12-04T12:46:35.5147731Z * [new branch] nightly -> origin/nightly 2025-12-04T12:46:35.5147850Z * [new branch] nikitaved/addmm_1_rowcol_lt_path_check -> origin/nikitaved/addmm_1_rowcol_lt_path_check 2025-12-04T12:46:35.5148011Z * [new branch] nikitaved/addmm_epilogue_fusions_2d_bias -> origin/nikitaved/addmm_epilogue_fusions_2d_bias 2025-12-04T12:46:35.5148139Z * [new branch] nikitaved/addmm_epilogue_fusions_inductor -> origin/nikitaved/addmm_epilogue_fusions_inductor 2025-12-04T12:46:35.5148261Z * [new branch] nikitaved/addmm_epilogue_fusions_scratch -> origin/nikitaved/addmm_epilogue_fusions_scratch 2025-12-04T12:46:35.5148417Z * [new branch] nikitaved/grad_addmm_epilogue_fusions -> origin/nikitaved/grad_addmm_epilogue_fusions 2025-12-04T12:46:35.5148531Z * [new branch] nikitaved/simpler_can_use_32bit_index -> origin/nikitaved/simpler_can_use_32bit_index 2025-12-04T12:46:35.5148598Z * [new branch] nikitaved/test -> origin/nikitaved/test 2025-12-04T12:46:35.5148723Z * [new branch] nmacchioni-perf-test-async-autotune -> origin/nmacchioni-perf-test-async-autotune 2025-12-04T12:46:35.5148801Z * [new branch] no_distributed_log_spew -> origin/no_distributed_log_spew 2025-12-04T12:46:35.5148866Z * [new branch] nofun-hack -> origin/nofun-hack 2025-12-04T12:46:35.5148928Z * [new branch] norm_bench -> origin/norm_bench 2025-12-04T12:46:35.5149003Z * [new branch] nullplay/fuse_matmul -> origin/nullplay/fuse_matmul 2025-12-04T12:46:35.5149076Z * [new branch] nullplay_fuse_matmul -> origin/nullplay_fuse_matmul 2025-12-04T12:46:35.5149143Z * [new branch] optimizer_test -> origin/optimizer_test 2025-12-04T12:46:35.5149212Z * [new branch] orig/release/1.10 -> origin/orig/release/1.10 2025-12-04T12:46:35.5149279Z * [new branch] orig/release/1.11 -> origin/orig/release/1.11 2025-12-04T12:46:35.5149345Z * [new branch] orig/release/1.12 -> origin/orig/release/1.12 2025-12-04T12:46:35.5149413Z * [new branch] orig/release/1.13 -> origin/orig/release/1.13 2025-12-04T12:46:35.5149479Z * [new branch] orig/release/1.6 -> origin/orig/release/1.6 2025-12-04T12:46:35.5149545Z * [new branch] orig/release/1.7 -> origin/orig/release/1.7 2025-12-04T12:46:35.5149610Z * [new branch] orig/release/1.8 -> origin/orig/release/1.8 2025-12-04T12:46:35.5149675Z * [new branch] orig/release/1.9 -> origin/orig/release/1.9 2025-12-04T12:46:35.5149741Z * [new branch] orig/release/2.0 -> origin/orig/release/2.0 2025-12-04T12:46:35.5149806Z * [new branch] orig/release/2.1 -> origin/orig/release/2.1 2025-12-04T12:46:35.5149869Z * [new branch] orig/release/2.2 -> origin/orig/release/2.2 2025-12-04T12:46:35.5149934Z * [new branch] orig/release/2.3 -> origin/orig/release/2.3 2025-12-04T12:46:35.5149998Z * [new branch] orig/release/2.4 -> origin/orig/release/2.4 2025-12-04T12:46:35.5150063Z * [new branch] orig/release/2.5 -> origin/orig/release/2.5 2025-12-04T12:46:35.5150128Z * [new branch] orig/release/2.6 -> origin/orig/release/2.6 2025-12-04T12:46:35.5150192Z * [new branch] orig/release/2.7 -> origin/orig/release/2.7 2025-12-04T12:46:35.5150256Z * [new branch] orig/release/2.8 -> origin/orig/release/2.8 2025-12-04T12:46:35.5150323Z * [new branch] orig/release/2.9 -> origin/orig/release/2.9 2025-12-04T12:46:35.5150407Z * [new branch] origin/gh/fxdawnn/1/base -> origin/origin/gh/fxdawnn/1/base 2025-12-04T12:46:35.5150490Z * [new branch] origin/gh/fxdawnn/1/orig -> origin/origin/gh/fxdawnn/1/orig 2025-12-04T12:46:35.5150573Z * [new branch] origin/gh/zpcore/14/orig -> origin/origin/gh/zpcore/14/orig 2025-12-04T12:46:35.5150641Z * [new branch] oulgen-patch-1 -> origin/oulgen-patch-1 2025-12-04T12:46:35.5150731Z * [new branch] oulgen-patch-2 -> origin/oulgen-patch-2 2025-12-04T12:46:35.5150799Z * [new branch] oulgen-patch-3 -> origin/oulgen-patch-3 2025-12-04T12:46:35.5150865Z * [new branch] oulgen-patch-4 -> origin/oulgen-patch-4 2025-12-04T12:46:35.5150932Z * [new branch] padded-tensor -> origin/padded-tensor 2025-12-04T12:46:35.5151016Z * [new branch] pca2 -> origin/pca2 2025-12-04T12:46:35.5151089Z * [new branch] per_channel_backup -> origin/per_channel_backup 2025-12-04T12:46:35.5151152Z * [new branch] perf_ops -> origin/perf_ops 2025-12-04T12:46:35.5151215Z * [new branch] perf_ops_2_9 -> origin/perf_ops_2_9 2025-12-04T12:46:35.5151286Z * [new branch] pianpwk-patch-1 -> origin/pianpwk-patch-1 2025-12-04T12:46:35.5151373Z * [new branch] pianpwk/__draft_debug_mode -> origin/pianpwk/__draft_debug_mode 2025-12-04T12:46:35.5151484Z * [new branch] pianpwk/_debug_mode_for_triton_draft -> origin/pianpwk/_debug_mode_for_triton_draft 2025-12-04T12:46:35.5151585Z * [new branch] pianpwk/_debug_nn_module_compile -> origin/pianpwk/_debug_nn_module_compile 2025-12-04T12:46:35.5151672Z * [new branch] pianpwk/_draft_triton_11_3 -> origin/pianpwk/_draft_triton_11_3 2025-12-04T12:46:35.5151764Z * [new branch] pianpwk/_manual_bucket_draft -> origin/pianpwk/_manual_bucket_draft 2025-12-04T12:46:35.5151866Z * [new branch] pianpwk/_profile_w_dispatch_keys -> origin/pianpwk/_profile_w_dispatch_keys 2025-12-04T12:46:35.5151964Z * [new branch] pianpwk/_super_draft_debug_mode -> origin/pianpwk/_super_draft_debug_mode 2025-12-04T12:46:35.5152069Z * [new branch] pianpwk/_unbacked_local_shard_size -> origin/pianpwk/_unbacked_local_shard_size 2025-12-04T12:46:35.5152143Z * [new branch] pianpwk/anomaly_tb -> origin/pianpwk/anomaly_tb 2025-12-04T12:46:35.5152225Z * [new branch] pianpwk/auto_fx_annotate -> origin/pianpwk/auto_fx_annotate 2025-12-04T12:46:35.5152337Z * [new branch] pianpwk/backed_size_oblivious_export -> origin/pianpwk/backed_size_oblivious_export 2025-12-04T12:46:35.5152422Z * [new branch] pianpwk/bert_dynamic_perf -> origin/pianpwk/bert_dynamic_perf 2025-12-04T12:46:35.5152520Z * [new branch] pianpwk/debug_fwd_stack_traces -> origin/pianpwk/debug_fwd_stack_traces 2025-12-04T12:46:35.5152605Z * [new branch] pianpwk/debug_hash_tensor -> origin/pianpwk/debug_hash_tensor 2025-12-04T12:46:35.5152696Z * [new branch] pianpwk/debug_mode_annotate -> origin/pianpwk/debug_mode_annotate 2025-12-04T12:46:35.5152785Z * [new branch] pianpwk/debug_mode_defaults -> origin/pianpwk/debug_mode_defaults 2025-12-04T12:46:35.5152865Z * [new branch] pianpwk/debug_mode_hacks -> origin/pianpwk/debug_mode_hacks 2025-12-04T12:46:35.5152973Z * [new branch] pianpwk/debug_mode_opcall_refactor -> origin/pianpwk/debug_mode_opcall_refactor 2025-12-04T12:46:35.5153061Z * [new branch] pianpwk/debug_mode_show_ids -> origin/pianpwk/debug_mode_show_ids 2025-12-04T12:46:35.5153143Z * [new branch] pianpwk/debug_mode_triton -> origin/pianpwk/debug_mode_triton 2025-12-04T12:46:35.5153241Z * [new branch] pianpwk/debug_show_stack_trace -> origin/pianpwk/debug_show_stack_trace 2025-12-04T12:46:35.5153341Z * [new branch] pianpwk/debug_wait_on_collective -> origin/pianpwk/debug_wait_on_collective 2025-12-04T12:46:35.5153437Z * [new branch] pianpwk/debugmode_compile_tf -> origin/pianpwk/debugmode_compile_tf 2025-12-04T12:46:35.5153563Z * [new branch] pianpwk/dispatch_key_debugging_for_debug -> origin/pianpwk/dispatch_key_debugging_for_debug 2025-12-04T12:46:35.5153691Z * [new branch] pianpwk/draft_debug_mode_tfcompile -> origin/pianpwk/draft_debug_mode_tfcompile 2025-12-04T12:46:35.5153786Z * [new branch] pianpwk/draft_multikernel_nn -> origin/pianpwk/draft_multikernel_nn 2025-12-04T12:46:35.5153900Z * [new branch] pianpwk/draft_multikernel_status_10_5 -> origin/pianpwk/draft_multikernel_status_10_5 2025-12-04T12:46:35.5154013Z * [new branch] pianpwk/dtensor_custom_chunk -> origin/pianpwk/dtensor_custom_chunk 2025-12-04T12:46:35.5154117Z * [new branch] pianpwk/dtensor_unbacked_keypath -> origin/pianpwk/dtensor_unbacked_keypath 2025-12-04T12:46:35.5154196Z * [new branch] pianpwk/event_list_tree -> origin/pianpwk/event_list_tree 2025-12-04T12:46:35.5154277Z * [new branch] pianpwk/false_numel_refs -> origin/pianpwk/false_numel_refs 2025-12-04T12:46:35.5154356Z * [new branch] pianpwk/maybe_guard_rel -> origin/pianpwk/maybe_guard_rel 2025-12-04T12:46:35.5154461Z * [new branch] pianpwk/multikernel_hints_draft -> origin/pianpwk/multikernel_hints_draft 2025-12-04T12:46:35.5154568Z * [new branch] pianpwk/no_size_oblivious_slice_scat -> origin/pianpwk/no_size_oblivious_slice_scat 2025-12-04T12:46:35.5154684Z * [new branch] pianpwk/oblivious_reshape_view_better -> origin/pianpwk/oblivious_reshape_view_better 2025-12-04T12:46:35.5154768Z * [new branch] pianpwk/pre_forward_hook -> origin/pianpwk/pre_forward_hook 2025-12-04T12:46:35.5154873Z * [new branch] pianpwk/skip_python_keys_alternate -> origin/pianpwk/skip_python_keys_alternate 2025-12-04T12:46:35.5154978Z * [new branch] pianpwk/skip_python_keys_in_guards -> origin/pianpwk/skip_python_keys_in_guards 2025-12-04T12:46:35.5155058Z * [new branch] pianpwk/sym_tokens_draft -> origin/pianpwk/sym_tokens_draft 2025-12-04T12:46:35.5155137Z * [new branch] pianpwk/symint_one_hot -> origin/pianpwk/symint_one_hot 2025-12-04T12:46:35.5155253Z * [new branch] pianpwk/test_pointwise_guard_or_false -> origin/pianpwk/test_pointwise_guard_or_false 2025-12-04T12:46:35.5155351Z * [new branch] pianpwk/totally_draft_sym_wrap -> origin/pianpwk/totally_draft_sym_wrap 2025-12-04T12:46:35.5155433Z * [new branch] pianpwk/try_dumb_stuff -> origin/pianpwk/try_dumb_stuff 2025-12-04T12:46:35.5155514Z * [new branch] pianpwk/try_dumb_stuff_2 -> origin/pianpwk/try_dumb_stuff_2 2025-12-04T12:46:35.5155604Z * [new branch] pianpwk/unbacked_dtensor_mm -> origin/pianpwk/unbacked_dtensor_mm 2025-12-04T12:46:35.5155700Z * [new branch] pianpwk/unbacked_tracing_12_2 -> origin/pianpwk/unbacked_tracing_12_2 2025-12-04T12:46:35.5155776Z * [new branch] pianpwk/user_symints -> origin/pianpwk/user_symints 2025-12-04T12:46:35.5155853Z * [new branch] pianpwk/wan21_reshape -> origin/pianpwk/wan21_reshape 2025-12-04T12:46:35.5155947Z * [new branch] piz/fix_partial_backward_1112 -> origin/piz/fix_partial_backward_1112 2025-12-04T12:46:35.5156022Z * [new branch] piz/prop_cache_clean -> origin/piz/prop_cache_clean 2025-12-04T12:46:35.5156089Z * [new branch] pool-separate -> origin/pool-separate 2025-12-04T12:46:35.5156152Z * [new branch] pr-156087 -> origin/pr-156087 2025-12-04T12:46:35.5156213Z * [new branch] pr/131860 -> origin/pr/131860 2025-12-04T12:46:35.5156281Z * [new branch] predispatch_to -> origin/predispatch_to 2025-12-04T12:46:35.5156347Z * [new branch] protect-c17 -> origin/protect-c17 2025-12-04T12:46:35.5156412Z * [new branch] pt-opt-cuda3 -> origin/pt-opt-cuda3 2025-12-04T12:46:35.5156492Z * [new branch] python_compiled_autograd -> origin/python_compiled_autograd 2025-12-04T12:46:35.5156657Z * [new branch] q1l1/fix_device_moved_constant_type_unknown -> origin/q1l1/fix_device_moved_constant_type_unknown 2025-12-04T12:46:35.5156795Z * [new branch] q1l1/fix_wrong_default_type_for_kernel_call_args -> origin/q1l1/fix_wrong_default_type_for_kernel_call_args 2025-12-04T12:46:35.5156876Z * [new branch] qchip/export-D54134695 -> origin/qchip/export-D54134695 2025-12-04T12:46:35.5156976Z * [new branch] quote-pytest_cache -> origin/quote-pytest_cache 2025-12-04T12:46:35.5157074Z * [new branch] reland-accgrad-stream-warn -> origin/reland-accgrad-stream-warn 2025-12-04T12:46:35.5157139Z * [new branch] release/1.10 -> origin/release/1.10 2025-12-04T12:46:35.5157202Z * [new branch] release/1.11 -> origin/release/1.11 2025-12-04T12:46:35.5157264Z * [new branch] release/1.12 -> origin/release/1.12 2025-12-04T12:46:35.5157327Z * [new branch] release/1.13 -> origin/release/1.13 2025-12-04T12:46:35.5157390Z * [new branch] release/1.4 -> origin/release/1.4 2025-12-04T12:46:35.5157453Z * [new branch] release/1.4.1 -> origin/release/1.4.1 2025-12-04T12:46:35.5157546Z * [new branch] release/1.5 -> origin/release/1.5 2025-12-04T12:46:35.5157611Z * [new branch] release/1.6 -> origin/release/1.6 2025-12-04T12:46:35.5157670Z * [new branch] release/1.7 -> origin/release/1.7 2025-12-04T12:46:35.5157731Z * [new branch] release/1.8 -> origin/release/1.8 2025-12-04T12:46:35.5157791Z * [new branch] release/1.9 -> origin/release/1.9 2025-12-04T12:46:35.5157850Z * [new branch] release/2.0 -> origin/release/2.0 2025-12-04T12:46:35.5157910Z * [new branch] release/2.1 -> origin/release/2.1 2025-12-04T12:46:35.5157971Z * [new branch] release/2.2 -> origin/release/2.2 2025-12-04T12:46:35.5158031Z * [new branch] release/2.3 -> origin/release/2.3 2025-12-04T12:46:35.5158091Z * [new branch] release/2.4 -> origin/release/2.4 2025-12-04T12:46:35.5158149Z * [new branch] release/2.5 -> origin/release/2.5 2025-12-04T12:46:35.5158210Z * [new branch] release/2.6 -> origin/release/2.6 2025-12-04T12:46:35.5163828Z * [new branch] release/2.7 -> origin/release/2.7 2025-12-04T12:46:35.5163908Z * [new branch] release/2.8 -> origin/release/2.8 2025-12-04T12:46:35.5163970Z * [new branch] release/2.9 -> origin/release/2.9 2025-12-04T12:46:35.5164037Z * [new branch] release_notes -> origin/release_notes 2025-12-04T12:46:35.5164116Z * [new branch] remove_pyinterpreter -> origin/remove_pyinterpreter 2025-12-04T12:46:35.5164247Z * [new branch] replace-pytorch-labs-20250812-195836 -> origin/replace-pytorch-labs-20250812-195836 2025-12-04T12:46:35.5164371Z * [new branch] replace-pytorch-labs-20250812-200248 -> origin/replace-pytorch-labs-20250812-200248 2025-12-04T12:46:35.5164489Z * [new branch] replace-pytorch-labs-20250812-200324 -> origin/replace-pytorch-labs-20250812-200324 2025-12-04T12:46:35.5164608Z * [new branch] replace-pytorch-labs-20250812-204020 -> origin/replace-pytorch-labs-20250812-204020 2025-12-04T12:46:35.5164744Z * [new branch] revert-131069-gh/krzysztofjordan/1/head -> origin/revert-131069-gh/krzysztofjordan/1/head 2025-12-04T12:46:35.5164854Z * [new branch] revert-131469-gh/andrewor14/51/head -> origin/revert-131469-gh/andrewor14/51/head 2025-12-04T12:46:35.5164959Z * [new branch] revert-152361-gh/fadara01/1/head -> origin/revert-152361-gh/fadara01/1/head 2025-12-04T12:46:35.5165119Z * [new branch] revert-156870-gh/skarjala/3/head -> origin/revert-156870-gh/skarjala/3/head 2025-12-04T12:46:35.5165292Z * [new branch] revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ -> origin/revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ 2025-12-04T12:46:35.5165390Z * [new branch] revert-hoo-invoke-subgraph -> origin/revert-hoo-invoke-subgraph 2025-12-04T12:46:35.5165529Z * [new branch] revert_always_build_distributed -> origin/revert_always_build_distributed 2025-12-04T12:46:35.5165596Z * [new branch] rms_norm_patch -> origin/rms_norm_patch 2025-12-04T12:46:35.5165694Z * [new branch] ruisi/fix_all_to_all_estimation -> origin/ruisi/fix_all_to_all_estimation 2025-12-04T12:46:35.5165777Z * [new branch] ruisi/fix_comm_estimation -> origin/ruisi/fix_comm_estimation 2025-12-04T12:46:35.5165885Z * [new branch] ruisi/fix_dynamic_shape_estimation -> origin/ruisi/fix_dynamic_shape_estimation 2025-12-04T12:46:35.5165987Z * [new branch] ruisi/fix_llama3_autobucketing -> origin/ruisi/fix_llama3_autobucketing 2025-12-04T12:46:35.5166092Z * [new branch] ruisi/fix_manual_bucketing_ep_pass -> origin/ruisi/fix_manual_bucketing_ep_pass 2025-12-04T12:46:35.5166175Z * [new branch] ruisi/manual_bucket_pass -> origin/ruisi/manual_bucket_pass 2025-12-04T12:46:35.5166326Z * [new branch] ryanguo99/cleanup-dynamo-expected-failures -> origin/ryanguo99/cleanup-dynamo-expected-failures 2025-12-04T12:46:35.5166413Z * [new branch] ryanguo99/fix-closure-var -> origin/ryanguo99/fix-closure-var 2025-12-04T12:46:35.5166490Z * [new branch] rzou/faketensor_bench -> origin/rzou/faketensor_bench 2025-12-04T12:46:35.5166553Z * [new branch] rzou/njt -> origin/rzou/njt 2025-12-04T12:46:35.5166614Z * [new branch] rzou/pca -> origin/rzou/pca 2025-12-04T12:46:35.5166681Z * [new branch] rzou/realprop -> origin/rzou/realprop 2025-12-04T12:46:35.5166747Z * [new branch] samplevllm -> origin/samplevllm 2025-12-04T12:46:35.5166914Z * [new branch] sanchitintel/weird_thing_with_test_cpu_select_algorithm -> origin/sanchitintel/weird_thing_with_test_cpu_select_algorithm 2025-12-04T12:46:35.5167012Z * [new branch] sapling-pr-archive-SS-JIA -> origin/sapling-pr-archive-SS-JIA 2025-12-04T12:46:35.5167125Z * [new branch] sapling-pr-archive-tushar00jain -> origin/sapling-pr-archive-tushar00jain 2025-12-04T12:46:35.5167185Z * [new branch] save -> origin/save 2025-12-04T12:46:35.5167248Z * [new branch] scaled_mm -> origin/scaled_mm 2025-12-04T12:46:35.5167314Z * [new branch] scan_attempt -> origin/scan_attempt 2025-12-04T12:46:35.5167376Z * [new branch] sdym/2.5.1 -> origin/sdym/2.5.1 2025-12-04T12:46:35.5167518Z * [new branch] sekyondaMeta-dynamoconfig-fix -> origin/sekyondaMeta-dynamoconfig-fix 2025-12-04T12:46:35.5167597Z * [new branch] shengf/fx-xform-perf -> origin/shengf/fx-xform-perf 2025-12-04T12:46:35.5167675Z * [new branch] shoumikhin-patch-1 -> origin/shoumikhin-patch-1 2025-12-04T12:46:35.5167753Z * [new branch] solve-accuracy-fix -> origin/solve-accuracy-fix 2025-12-04T12:46:35.5167833Z * [new branch] some_rocm_inductor_skips -> origin/some_rocm_inductor_skips 2025-12-04T12:46:35.5167914Z * [new branch] soulitzer/stash-tls-ac -> origin/soulitzer/stash-tls-ac 2025-12-04T12:46:35.5167997Z * [new branch] sparse-mm-bf16-support -> origin/sparse-mm-bf16-support 2025-12-04T12:46:35.5168070Z * [new branch] starterTaskUpdate -> origin/starterTaskUpdate 2025-12-04T12:46:35.5168169Z * [new branch] suo -> origin/suo 2025-12-04T12:46:35.5168234Z * [new branch] sve-poc -> origin/sve-poc 2025-12-04T12:46:35.5168295Z * [new branch] switch-bn -> origin/switch-bn 2025-12-04T12:46:35.5168387Z * [new branch] sy_annotation_in_autograd_hop -> origin/sy_annotation_in_autograd_hop 2025-12-04T12:46:35.5168493Z * [new branch] sy_aot_eager_record -> origin/sy_aot_eager_record 2025-12-04T12:46:35.5168562Z * [new branch] sy_custom_bucketing -> origin/sy_custom_bucketing 2025-12-04T12:46:35.5168630Z * [new branch] sy_debug_mode_test -> origin/sy_debug_mode_test 2025-12-04T12:46:35.5168695Z * [new branch] sy_deserialize -> origin/sy_deserialize 2025-12-04T12:46:35.5168761Z * [new branch] sy_dump_gm_code -> origin/sy_dump_gm_code 2025-12-04T12:46:35.5168822Z * [new branch] sy_exp -> origin/sy_exp 2025-12-04T12:46:35.5168894Z * [new branch] sy_export_annotation -> origin/sy_export_annotation 2025-12-04T12:46:35.5168961Z * [new branch] sy_invoke_subgraph -> origin/sy_invoke_subgraph 2025-12-04T12:46:35.5169030Z * [new branch] sy_kernel_bw_name -> origin/sy_kernel_bw_name 2025-12-04T12:46:35.5169093Z * [new branch] sy_multi_arch -> origin/sy_multi_arch 2025-12-04T12:46:35.5169160Z * [new branch] sy_nn_module_stack -> origin/sy_nn_module_stack 2025-12-04T12:46:35.5169230Z * [new branch] sy_original_dtensor -> origin/sy_original_dtensor 2025-12-04T12:46:35.5169295Z * [new branch] sy_profiler_cia -> origin/sy_profiler_cia 2025-12-04T12:46:35.5169358Z * [new branch] symm_mem_sync -> origin/symm_mem_sync 2025-12-04T12:46:35.5169442Z * [new branch] sympy-bottleneck-repro -> origin/sympy-bottleneck-repro 2025-12-04T12:46:35.5169521Z * [new branch] tensordict_integration -> origin/tensordict_integration 2025-12-04T12:46:35.5169601Z * [new branch] test-move-conda-builds -> origin/test-move-conda-builds 2025-12-04T12:46:35.5169663Z * [new branch] test-old -> origin/test-old 2025-12-04T12:46:35.5169730Z * [new branch] test/bmm_heur -> origin/test/bmm_heur 2025-12-04T12:46:35.5169827Z * [new branch] tianren/customOp_autotune_fix -> origin/tianren/customOp_autotune_fix 2025-12-04T12:46:35.5169939Z * [new branch] tianren/customOp_enable_max_autotune -> origin/tianren/customOp_enable_max_autotune 2025-12-04T12:46:35.5170018Z * [new branch] tianren/customOp_fusion -> origin/tianren/customOp_fusion 2025-12-04T12:46:35.5170143Z * [new branch] tianren/customop_collectiveop_benchmark -> origin/tianren/customop_collectiveop_benchmark 2025-12-04T12:46:35.5170279Z * [new branch] tianren/customop_collectiveop_benchmark_fix -> origin/tianren/customop_collectiveop_benchmark_fix 2025-12-04T12:46:35.5170380Z * [new branch] tianren/customop_dynamic_config -> origin/tianren/customop_dynamic_config 2025-12-04T12:46:35.5170473Z * [new branch] tianren/dynamic_range_input -> origin/tianren/dynamic_range_input 2025-12-04T12:46:35.5170574Z * [new branch] tianren/dynamic_range_input_fix -> origin/tianren/dynamic_range_input_fix 2025-12-04T12:46:35.5170678Z * [new branch] tianren/dynamic_range_input_merge -> origin/tianren/dynamic_range_input_merge 2025-12-04T12:46:35.5170778Z * [new branch] tianren/flex_paged_attn_fix_temp -> origin/tianren/flex_paged_attn_fix_temp 2025-12-04T12:46:35.5170856Z * [new branch] tianren/fx_codegen_dump -> origin/tianren/fx_codegen_dump 2025-12-04T12:46:35.5170939Z * [new branch] tianren/symmetric_memory -> origin/tianren/symmetric_memory 2025-12-04T12:46:35.5171031Z * [new branch] tianren/test -> origin/tianren/test 2025-12-04T12:46:35.5171108Z * [new branch] tidy_performance_cyy -> origin/tidy_performance_cyy 2025-12-04T12:46:35.5171166Z * [new branch] tmp -> origin/tmp 2025-12-04T12:46:35.5171265Z * [new branch] torchtitan_ep -> origin/torchtitan_ep 2025-12-04T12:46:35.5171341Z * [new branch] torchtitan_integration -> origin/torchtitan_integration 2025-12-04T12:46:35.5171423Z * [new branch] trace_fsdp_torchtune_lora -> origin/trace_fsdp_torchtune_lora 2025-12-04T12:46:35.5171508Z * [new branch] traceable_fsdp_unit_tests -> origin/traceable_fsdp_unit_tests 2025-12-04T12:46:35.5171577Z * [new branch] tree_loop_vec_base -> origin/tree_loop_vec_base 2025-12-04T12:46:35.5171641Z * [new branch] triton_kernel -> origin/triton_kernel 2025-12-04T12:46:35.5171703Z * [new branch] tt_pkg_1908 -> origin/tt_pkg_1908 2025-12-04T12:46:35.5171764Z * [new branch] type_dec -> origin/type_dec 2025-12-04T12:46:35.5171858Z * [new branch] udate-sphinx-dependancies -> origin/udate-sphinx-dependancies 2025-12-04T12:46:35.5171998Z * [new branch] update-audio-commit-hash/17630256502-1803-1 -> origin/update-audio-commit-hash/17630256502-1803-1 2025-12-04T12:46:35.5172130Z * [new branch] update-audio-commit-hash/19087141161-1916-1 -> origin/update-audio-commit-hash/19087141161-1916-1 2025-12-04T12:46:35.5172259Z * [new branch] update-audio-commit-hash/19250643381-1929-1 -> origin/update-audio-commit-hash/19250643381-1929-1 2025-12-04T12:46:35.5172388Z * [new branch] update-audio-commit-hash/19397724337-1935-1 -> origin/update-audio-commit-hash/19397724337-1935-1 2025-12-04T12:46:35.5172517Z * [new branch] update-audio-commit-hash/19555670148-1941-1 -> origin/update-audio-commit-hash/19555670148-1941-1 2025-12-04T12:46:35.5172648Z * [new branch] update-audio-commit-hash/19750627930-1946-1 -> origin/update-audio-commit-hash/19750627930-1946-1 2025-12-04T12:46:35.5172782Z * [new branch] update-triton-commit-hash/13663274526-1487-2 -> origin/update-triton-commit-hash/13663274526-1487-2 2025-12-04T12:46:35.5172916Z * [new branch] update-vision-commit-hash/19087141161-1916-1 -> origin/update-vision-commit-hash/19087141161-1916-1 2025-12-04T12:46:35.5173048Z * [new branch] update-vision-commit-hash/19184897099-1925-1 -> origin/update-vision-commit-hash/19184897099-1925-1 2025-12-04T12:46:35.5173179Z * [new branch] update-vision-commit-hash/19250643381-1929-1 -> origin/update-vision-commit-hash/19250643381-1929-1 2025-12-04T12:46:35.5173310Z * [new branch] update-vision-commit-hash/19381328640-1934-1 -> origin/update-vision-commit-hash/19381328640-1934-1 2025-12-04T12:46:35.5173442Z * [new branch] update-vision-commit-hash/19485237164-1938-1 -> origin/update-vision-commit-hash/19485237164-1938-1 2025-12-04T12:46:35.5173571Z * [new branch] update-vllm-commit-hash/18451675449-1879-1 -> origin/update-vllm-commit-hash/18451675449-1879-1 2025-12-04T12:46:35.5173656Z * [new branch] update-vllm-dockerfile -> origin/update-vllm-dockerfile 2025-12-04T12:46:35.5173779Z * [new branch] update-xla-commit-hash/19224287370-211-1 -> origin/update-xla-commit-hash/19224287370-211-1 2025-12-04T12:46:35.5173901Z * [new branch] update-xla-commit-hash/19422028566-212-1 -> origin/update-xla-commit-hash/19422028566-212-1 2025-12-04T12:46:35.5174022Z * [new branch] update-xla-commit-hash/19626841311-213-1 -> origin/update-xla-commit-hash/19626841311-213-1 2025-12-04T12:46:35.5174172Z * [new branch] update_docs_torch_multinomial_issue#125388 -> origin/update_docs_torch_multinomial_issue#125388 2025-12-04T12:46:35.5174252Z * [new branch] update_operator_readme -> origin/update_operator_readme 2025-12-04T12:46:35.5174343Z * [new branch] update_slow_tests_1722488736 -> origin/update_slow_tests_1722488736 2025-12-04T12:46:35.5174430Z * [new branch] update_slow_tests_1722879173 -> origin/update_slow_tests_1722879173 2025-12-04T12:46:35.5174542Z * [new branch] update_slow_tests_1762155677 -> origin/update_slow_tests_1762155677 2025-12-04T12:46:35.5174627Z * [new branch] update_slow_tests_1763365283 -> origin/update_slow_tests_1763365283 2025-12-04T12:46:35.5174711Z * [new branch] update_submodule_FBGEMM -> origin/update_submodule_FBGEMM 2025-12-04T12:46:35.5174789Z * [new branch] update_submodule_kineto -> origin/update_submodule_kineto 2025-12-04T12:46:35.5174879Z * [new branch] update_submodule_tensorpipe -> origin/update_submodule_tensorpipe 2025-12-04T12:46:35.5174979Z * [new branch] upload-tests-for-autorevert -> origin/upload-tests-for-autorevert 2025-12-04T12:46:35.5175043Z * [new branch] v0.1.2 -> origin/v0.1.2 2025-12-04T12:46:35.5175104Z * [new branch] v1.0.1 -> origin/v1.0.1 2025-12-04T12:46:35.5175164Z * [new branch] v1.0.3 -> origin/v1.0.3 2025-12-04T12:46:35.5175223Z * [new branch] v1.1.0 -> origin/v1.1.0 2025-12-04T12:46:35.5175279Z * [new branch] v1.2.0 -> origin/v1.2.0 2025-12-04T12:46:35.5175335Z * [new branch] v1.3.0 -> origin/v1.3.0 2025-12-04T12:46:35.5175392Z * [new branch] v1.3.1 -> origin/v1.3.1 2025-12-04T12:46:35.5175455Z * [new branch] validate_fn -> origin/validate_fn 2025-12-04T12:46:35.5175525Z * [new branch] validations_2.6 -> origin/validations_2.6 2025-12-04T12:46:35.5175592Z * [new branch] validations_2.8 -> origin/validations_2.8 2025-12-04T12:46:35.5175657Z * [new branch] varlen-api -> origin/varlen-api 2025-12-04T12:46:35.5175732Z * [new branch] varlen-api-backup -> origin/varlen-api-backup 2025-12-04T12:46:35.5175812Z * [new branch] varlen_batch_invariance -> origin/varlen_batch_invariance 2025-12-04T12:46:35.5175878Z * [new branch] viable/strict -> origin/viable/strict 2025-12-04T12:46:35.5175994Z * [new branch] vishal9-team/dtensor_parallelism_toy -> origin/vishal9-team/dtensor_parallelism_toy 2025-12-04T12:46:35.5176059Z * [new branch] vllmbuildci -> origin/vllmbuildci 2025-12-04T12:46:35.5176120Z * [new branch] vllmpin -> origin/vllmpin 2025-12-04T12:46:35.5176211Z * [new branch] vscode-recommend-pyrefly -> origin/vscode-recommend-pyrefly 2025-12-04T12:46:35.5176279Z * [new branch] wdvr-patch-1 -> origin/wdvr-patch-1 2025-12-04T12:46:35.5176343Z * [new branch] wdvr/iss_145259 -> origin/wdvr/iss_145259 2025-12-04T12:46:35.5176406Z * [new branch] whc/pei -> origin/whc/pei 2025-12-04T12:46:35.5176472Z * [new branch] whc/pp_fix -> origin/whc/pp_fix 2025-12-04T12:46:35.5176534Z * [new branch] whc/sharding -> origin/whc/sharding 2025-12-04T12:46:35.5176602Z * [new branch] whc/sharding2 -> origin/whc/sharding2 2025-12-04T12:46:35.5176664Z * [new branch] whc/uneven -> origin/whc/uneven 2025-12-04T12:46:35.5176734Z * [new branch] whc/uneven-merge -> origin/whc/uneven-merge 2025-12-04T12:46:35.5176797Z * [new branch] win_warnings -> origin/win_warnings 2025-12-04T12:46:35.5176896Z * [new branch] windows_libtorch_free -> origin/windows_libtorch_free 2025-12-04T12:46:35.5176958Z * [new branch] xmfan-war -> origin/xmfan-war 2025-12-04T12:46:35.5177022Z * [new branch] xmfan/ca_0516 -> origin/xmfan/ca_0516 2025-12-04T12:46:35.5177090Z * [new branch] xmfan/ca_1051b93192 -> origin/xmfan/ca_1051b93192 2025-12-04T12:46:35.5177263Z * [new branch] xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 -> origin/xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 2025-12-04T12:46:35.5177335Z * [new branch] xmfan/ca_5a2be192d1 -> origin/xmfan/ca_5a2be192d1 2025-12-04T12:46:35.5177403Z * [new branch] xmfan/ca_9d59b516e9 -> origin/xmfan/ca_9d59b516e9 2025-12-04T12:46:35.5177469Z * [new branch] xmfan/ca_apr8 -> origin/xmfan/ca_apr8 2025-12-04T12:46:35.5177572Z * [new branch] xmfan/ca_base -> origin/xmfan/ca_base 2025-12-04T12:46:35.5177641Z * [new branch] xmfan/ca_dynamic -> origin/xmfan/ca_dynamic 2025-12-04T12:46:35.5177706Z * [new branch] xmfan/ca_fix_dyn -> origin/xmfan/ca_fix_dyn 2025-12-04T12:46:35.5177781Z * [new branch] xmfan/ca_fix_lowering -> origin/xmfan/ca_fix_lowering 2025-12-04T12:46:35.5177858Z * [new branch] xmfan/ca_fix_polyfills -> origin/xmfan/ca_fix_polyfills 2025-12-04T12:46:35.5177921Z * [new branch] xmfan/ca_jan3 -> origin/xmfan/ca_jan3 2025-12-04T12:46:35.5177985Z * [new branch] xmfan/ca_jun18 -> origin/xmfan/ca_jun18 2025-12-04T12:46:35.5178049Z * [new branch] xmfan/ca_jun24 -> origin/xmfan/ca_jun24 2025-12-04T12:46:35.5178116Z * [new branch] xmfan/ca_nested -> origin/xmfan/ca_nested 2025-12-04T12:46:35.5178184Z * [new branch] xmfan/ca_overhead -> origin/xmfan/ca_overhead 2025-12-04T12:46:35.5178276Z * [new branch] xmfan/ca_overhead_0eba7e5451 -> origin/xmfan/ca_overhead_0eba7e5451 2025-12-04T12:46:35.5178344Z * [new branch] xmfan/cacu_jun18 -> origin/xmfan/cacu_jun18 2025-12-04T12:46:35.5178412Z * [new branch] xmfan/cacu_jun19 -> origin/xmfan/cacu_jun19 2025-12-04T12:46:35.5178479Z * [new branch] xmfan/cacu_jun4 -> origin/xmfan/cacu_jun4 2025-12-04T12:46:35.5178563Z * [new branch] xmfan/disable_duck_shape -> origin/xmfan/disable_duck_shape 2025-12-04T12:46:35.5178661Z * [new branch] xmfan/fca_cpp_node_passthrough -> origin/xmfan/fca_cpp_node_passthrough 2025-12-04T12:46:35.5178812Z * [new branch] xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T12:46:35.5178958Z * [new branch] xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T12:46:35.5179030Z * [new branch] xmfan/single_step -> origin/xmfan/single_step 2025-12-04T12:46:35.5179094Z * [new branch] xmfan/sth_0829 -> origin/xmfan/sth_0829 2025-12-04T12:46:35.5179157Z * [new branch] xmfan/test -> origin/xmfan/test 2025-12-04T12:46:35.5179245Z * [new branch] yguo/debug-0226-constexpr -> origin/yguo/debug-0226-constexpr 2025-12-04T12:46:35.5179322Z * [new branch] yguo/new_latest_changes -> origin/yguo/new_latest_changes 2025-12-04T12:46:35.5179417Z * [new branch] yguo/patch_constexpr_changes -> origin/yguo/patch_constexpr_changes 2025-12-04T12:46:35.5179484Z * [new branch] yiming/bootcamp -> origin/yiming/bootcamp 2025-12-04T12:46:35.5179586Z * [new branch] yiming/run_with_start_end_rng_hop -> origin/yiming/run_with_start_end_rng_hop 2025-12-04T12:46:35.5179699Z * [new branch] yolo-llama3 -> origin/yolo-llama3 2025-12-04T12:46:35.5179770Z * [new branch] zainr/canary-test -> origin/zainr/canary-test 2025-12-04T12:46:35.5179858Z * [new branch] zainr/cleanup-gh-runners -> origin/zainr/cleanup-gh-runners 2025-12-04T12:46:35.5179937Z * [new branch] zainr/pull-migration-c -> origin/zainr/pull-migration-c 2025-12-04T12:46:35.5180036Z * [new branch] zainr/test2 -> origin/zainr/test2 2025-12-04T12:46:35.5180111Z * [new branch] zasdfgbnm-patch-3 -> origin/zasdfgbnm-patch-3 2025-12-04T12:46:35.5180170Z * [new branch] zb2p -> origin/zb2p 2025-12-04T12:46:35.5180254Z * [new branch] zeros-and-scatter-part2 -> origin/zeros-and-scatter-part2 2025-12-04T12:46:35.5180342Z * [new branch] zhxchen17/ci/vllm_lora_oom -> origin/zhxchen17/ci/vllm_lora_oom 2025-12-04T12:46:35.5180445Z * [new branch] zhxchen17/ci/vllm_multimodal_oom -> origin/zhxchen17/ci/vllm_multimodal_oom 2025-12-04T12:46:35.5180520Z * [new branch] zhxchen17/ci/vllm_pin -> origin/zhxchen17/ci/vllm_pin 2025-12-04T12:46:35.5180645Z * [new branch] zhxchen17/dynamo/unsafe_drop_all_guards -> origin/zhxchen17/dynamo/unsafe_drop_all_guards 2025-12-04T12:46:35.5180745Z * [new branch] zhxchen17/export/call_override -> origin/zhxchen17/export/call_override 2025-12-04T12:46:35.5180830Z * [new branch] zhxchen17/export/codemod1 -> origin/zhxchen17/export/codemod1 2025-12-04T12:46:35.5180921Z * [new branch] zhxchen17/export/ctx_return -> origin/zhxchen17/export/ctx_return 2025-12-04T12:46:35.5181050Z * [new branch] zhxchen17/export/disable_side_effect_warn -> origin/zhxchen17/export/disable_side_effect_warn 2025-12-04T12:46:35.5181148Z * [new branch] zhxchen17/export/pytree_check -> origin/zhxchen17/export/pytree_check 2025-12-04T12:46:35.5181236Z * [new branch] zhxchen17/precompile/aoti -> origin/zhxchen17/precompile/aoti 2025-12-04T12:46:35.5181333Z * [new branch] zhxchen17/precompile/globals -> origin/zhxchen17/precompile/globals 2025-12-04T12:46:35.5181450Z * [new branch] zhxchen17/precompile/inductor_guards -> origin/zhxchen17/precompile/inductor_guards 2025-12-04T12:46:35.5181525Z * [new branch] zhxchen17/scratch/0 -> origin/zhxchen17/scratch/0 2025-12-04T12:46:35.5181631Z * [new branch] zhxchen17/torch_export_api_update -> origin/zhxchen17/torch_export_api_update 2025-12-04T12:46:35.5181707Z * [new branch] zhxhcen17/moodycamel -> origin/zhxhcen17/moodycamel 2025-12-04T12:46:35.5181782Z * [new branch] zxiiro/build-times -> origin/zxiiro/build-times 2025-12-04T12:46:35.5181853Z * [new branch] zxiiro/c7i.2xlarge -> origin/zxiiro/c7i.2xlarge 2025-12-04T12:46:35.5181934Z * [new branch] zxiiro/c7i.2xlarge.h100 -> origin/zxiiro/c7i.2xlarge.h100 2025-12-04T12:46:35.5181996Z * [new branch] zxiiro/main -> origin/zxiiro/main 2025-12-04T12:46:35.5182060Z * [new branch] zxiiro/risc64 -> origin/zxiiro/risc64 2025-12-04T12:46:35.5182153Z * [new branch] zxiiro/test-multicloud-arc -> origin/zxiiro/test-multicloud-arc 2025-12-04T12:46:35.5182223Z t [tag update] ciflow/inductor/169437 -> ciflow/inductor/169437 2025-12-04T12:46:35.5182286Z * [new tag] ciflow/inductor/169564 -> ciflow/inductor/169564 2025-12-04T12:46:35.5182344Z * [new tag] ciflow/rocm/169564 -> ciflow/rocm/169564 2025-12-04T12:46:35.5182408Z t [tag update] ciflow/trunk/169385 -> ciflow/trunk/169385 2025-12-04T12:46:35.5182474Z t [tag update] ciflow/trunk/169437 -> ciflow/trunk/169437 2025-12-04T12:46:35.5182644Z * [new tag] trunk/a2b5dfb956aed182f6aefce1ff2eda70c35049e1 -> trunk/a2b5dfb956aed182f6aefce1ff2eda70c35049e1 2025-12-04T12:46:35.7119851Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T12:46:35.7315618Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:35.7322147Z ##[endgroup] 2025-12-04T12:46:35.7322628Z ##[group]Determining the checkout info 2025-12-04T12:46:35.7324044Z ##[endgroup] 2025-12-04T12:46:35.7330076Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T12:46:35.7422559Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T12:46:35.7445595Z ##[group]Checking out the ref 2025-12-04T12:46:35.7447060Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:35.8346139Z Previous HEAD position was 685ba6bc0117 add back legalize_graph for BC reason (#169541) 2025-12-04T12:46:35.8351309Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T12:46:35.8455958Z ##[endgroup] 2025-12-04T12:46:35.8456285Z ##[group]Setting up auth for fetching submodules 2025-12-04T12:46:35.8461101Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T12:46:35.8500248Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T12:46:35.8526691Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T12:46:35.8544069Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T12:46:35.8572641Z ##[endgroup] 2025-12-04T12:46:35.8573068Z ##[group]Fetching submodules 2025-12-04T12:46:35.8573409Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T12:46:35.8775358Z Synchronizing submodule url for 'android/libs/fbjni' 2025-12-04T12:46:35.8785268Z Synchronizing submodule url for 'third_party/FP16' 2025-12-04T12:46:35.8796560Z Synchronizing submodule url for 'third_party/FXdiv' 2025-12-04T12:46:35.8813934Z Synchronizing submodule url for 'third_party/NNPACK' 2025-12-04T12:46:35.8828008Z Synchronizing submodule url for 'third_party/NVTX' 2025-12-04T12:46:35.8839716Z Synchronizing submodule url for 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:35.8854154Z Synchronizing submodule url for 'third_party/XNNPACK' 2025-12-04T12:46:35.8875327Z Synchronizing submodule url for 'third_party/aiter' 2025-12-04T12:46:35.8888044Z Synchronizing submodule url for 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:35.8910661Z Synchronizing submodule url for 'third_party/benchmark' 2025-12-04T12:46:35.8921598Z Synchronizing submodule url for 'third_party/composable_kernel' 2025-12-04T12:46:35.8937447Z Synchronizing submodule url for 'third_party/cpp-httplib' 2025-12-04T12:46:35.8948960Z Synchronizing submodule url for 'third_party/cpuinfo' 2025-12-04T12:46:35.8958620Z Synchronizing submodule url for 'third_party/cudnn_frontend' 2025-12-04T12:46:35.8969912Z Synchronizing submodule url for 'third_party/cutlass' 2025-12-04T12:46:35.8984735Z Synchronizing submodule url for 'third_party/fbgemm' 2025-12-04T12:46:35.8997008Z Synchronizing submodule url for 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:35.9006527Z Synchronizing submodule url for 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:35.9029635Z Synchronizing submodule url for 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:35.9043563Z Synchronizing submodule url for 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:35.9056756Z Synchronizing submodule url for 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:35.9074922Z Synchronizing submodule url for 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:35.9084597Z Synchronizing submodule url for 'third_party/fbgemm/external/json' 2025-12-04T12:46:35.9102750Z Synchronizing submodule url for 'third_party/flash-attention' 2025-12-04T12:46:35.9116219Z Synchronizing submodule url for 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:35.9133518Z Synchronizing submodule url for 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:35.9149413Z Synchronizing submodule url for 'third_party/flatbuffers' 2025-12-04T12:46:35.9162475Z Synchronizing submodule url for 'third_party/fmt' 2025-12-04T12:46:35.9173785Z Synchronizing submodule url for 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:35.9187805Z Synchronizing submodule url for 'third_party/gloo' 2025-12-04T12:46:35.9198919Z Synchronizing submodule url for 'third_party/googletest' 2025-12-04T12:46:35.9208946Z Synchronizing submodule url for 'third_party/ideep' 2025-12-04T12:46:35.9219335Z Synchronizing submodule url for 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:35.9234049Z Synchronizing submodule url for 'third_party/ittapi' 2025-12-04T12:46:35.9244341Z Synchronizing submodule url for 'third_party/kineto' 2025-12-04T12:46:35.9260741Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:35.9277961Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:35.9290800Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:35.9302790Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:35.9313679Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:35.9324900Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:35.9339934Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:35.9349387Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:35.9358719Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:35.9368191Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:35.9377020Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:35.9386666Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:35.9398810Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:35.9413045Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:35.9427636Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:35.9440500Z Synchronizing submodule url for 'third_party/kleidiai' 2025-12-04T12:46:35.9451218Z Synchronizing submodule url for 'third_party/mimalloc' 2025-12-04T12:46:35.9462626Z Synchronizing submodule url for 'third_party/nlohmann' 2025-12-04T12:46:35.9474638Z Synchronizing submodule url for 'third_party/onnx' 2025-12-04T12:46:35.9494158Z Synchronizing submodule url for 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:35.9509066Z Synchronizing submodule url for 'third_party/opentelemetry-cpp' 2025-12-04T12:46:35.9521454Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:35.9531487Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:35.9544536Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:35.9555122Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:35.9566452Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:35.9576929Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:35.9587300Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:35.9597573Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:35.9609095Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:35.9621104Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:35.9640382Z Synchronizing submodule url for 'third_party/pocketfft' 2025-12-04T12:46:35.9658061Z Synchronizing submodule url for 'third_party/protobuf' 2025-12-04T12:46:35.9670970Z Synchronizing submodule url for 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:35.9682916Z Synchronizing submodule url for 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:35.9695505Z Synchronizing submodule url for 'third_party/psimd' 2025-12-04T12:46:35.9706648Z Synchronizing submodule url for 'third_party/pthreadpool' 2025-12-04T12:46:35.9717275Z Synchronizing submodule url for 'third_party/pybind11' 2025-12-04T12:46:35.9733092Z Synchronizing submodule url for 'third_party/python-peachpy' 2025-12-04T12:46:35.9743568Z Synchronizing submodule url for 'third_party/sleef' 2025-12-04T12:46:35.9755632Z Synchronizing submodule url for 'third_party/tensorpipe' 2025-12-04T12:46:35.9766691Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:35.9779081Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:35.9793498Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:35.9804480Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:35.9819185Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:35.9855334Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T12:46:36.0118863Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T12:46:36.0205103Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T12:46:36.0271619Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T12:46:36.0384222Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T12:46:36.0455615Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T12:46:36.0519408Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T12:46:36.5292025Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T12:46:36.5493451Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T12:46:36.5691183Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T12:46:36.5819467Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T12:46:36.6011771Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T12:46:36.6087126Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T12:46:36.6712496Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T12:46:36.6792589Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T12:46:36.6912478Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T12:46:36.7650515Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T12:46:36.7970100Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T12:46:36.9722163Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T12:46:37.0370457Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T12:46:37.4014080Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T12:46:37.4239545Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:37.4338467Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T12:46:37.4886967Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T12:46:37.4991264Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T12:46:37.5197638Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T12:46:37.5339517Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T12:46:37.5458706Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T12:46:37.5626690Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T12:46:37.5827526Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T12:46:37.5950465Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T12:46:37.6147980Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:37.6217036Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T12:46:38.0329591Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T12:46:38.0410744Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T12:46:38.0496216Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T12:46:38.0590825Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T12:46:38.0687854Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T12:46:38.0757734Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T12:46:38.0818024Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T12:46:38.0878735Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T12:46:38.0942793Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T12:46:38.1022743Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T12:46:38.1096746Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:38.1188455Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T12:46:38.1245063Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T12:46:38.1324367Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T12:46:38.1420498Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T12:46:38.1481161Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T12:46:38.1541011Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T12:46:38.1605794Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:38.1690752Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T12:46:38.1765731Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T12:46:38.1860640Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T12:46:38.3553351Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T12:46:38.3742590Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T12:46:38.3863533Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T12:46:38.3957243Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T12:46:38.4030920Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T12:46:38.4104249Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T12:46:38.4201547Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T12:46:38.4263583Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T12:46:38.4329598Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T12:46:38.4393078Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T12:46:38.4468067Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T12:46:38.4543297Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T12:46:38.4686182Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T12:46:38.4747681Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T12:46:38.6068334Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T12:46:38.6165429Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T12:46:38.6389038Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T12:46:38.6451844Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T12:46:38.6535352Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T12:46:38.6728362Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T12:46:38.6957986Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T12:46:38.7219578Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T12:46:38.7342735Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T12:46:38.7530097Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T12:46:38.7625216Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T12:46:38.7901273Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T12:46:38.8027854Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T12:46:38.8089612Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T12:46:38.8128176Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T12:46:38.8349857Z Entering 'android/libs/fbjni' 2025-12-04T12:46:38.8379594Z Entering 'third_party/FP16' 2025-12-04T12:46:38.8401700Z Entering 'third_party/FXdiv' 2025-12-04T12:46:38.8425825Z Entering 'third_party/NNPACK' 2025-12-04T12:46:38.8447547Z Entering 'third_party/NVTX' 2025-12-04T12:46:38.8466209Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:38.8485933Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:38.8511307Z Entering 'third_party/aiter' 2025-12-04T12:46:38.8531783Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:38.8575258Z Entering 'third_party/benchmark' 2025-12-04T12:46:38.8594912Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:38.8626393Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:38.8646577Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:38.8675799Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:38.8702164Z Entering 'third_party/cutlass' 2025-12-04T12:46:38.8726396Z Entering 'third_party/fbgemm' 2025-12-04T12:46:38.8749581Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:38.8770372Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:38.8794625Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:38.8813978Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:38.8836892Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:38.8856705Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:38.8881599Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:38.8905419Z Entering 'third_party/flash-attention' 2025-12-04T12:46:38.8927798Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:38.8962344Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:38.8986371Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:38.9012879Z Entering 'third_party/fmt' 2025-12-04T12:46:38.9033303Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:38.9053452Z Entering 'third_party/gloo' 2025-12-04T12:46:38.9076762Z Entering 'third_party/googletest' 2025-12-04T12:46:38.9097338Z Entering 'third_party/ideep' 2025-12-04T12:46:38.9119444Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:38.9153214Z Entering 'third_party/ittapi' 2025-12-04T12:46:38.9175504Z Entering 'third_party/kineto' 2025-12-04T12:46:38.9197576Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:38.9222511Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:38.9254625Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:38.9282193Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:38.9305626Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:38.9328653Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:38.9359193Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:38.9380857Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:38.9401520Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:38.9422563Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:38.9440578Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:38.9459722Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:38.9481067Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:38.9506548Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:38.9539975Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:38.9569041Z Entering 'third_party/kleidiai' 2025-12-04T12:46:38.9592031Z Entering 'third_party/mimalloc' 2025-12-04T12:46:38.9612385Z Entering 'third_party/nlohmann' 2025-12-04T12:46:38.9634259Z Entering 'third_party/onnx' 2025-12-04T12:46:38.9658889Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:38.9687066Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:38.9714611Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:38.9748748Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:38.9769551Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:38.9791092Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:38.9811760Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:38.9831262Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:38.9850901Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:38.9868449Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:38.9890120Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:38.9911950Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:38.9941804Z Entering 'third_party/pocketfft' 2025-12-04T12:46:38.9961471Z Entering 'third_party/protobuf' 2025-12-04T12:46:38.9981390Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:39.0014288Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:39.0036599Z Entering 'third_party/psimd' 2025-12-04T12:46:39.0056475Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:39.0075988Z Entering 'third_party/pybind11' 2025-12-04T12:46:39.0094565Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:39.0114131Z Entering 'third_party/sleef' 2025-12-04T12:46:39.0133847Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:39.0155187Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:39.0174269Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:39.0195683Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:39.0214338Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:39.0231918Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:39.0264471Z ##[endgroup] 2025-12-04T12:46:39.0264661Z ##[group]Persisting credentials for submodules 2025-12-04T12:46:39.0272266Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T12:46:39.0447312Z Entering 'android/libs/fbjni' 2025-12-04T12:46:39.0476421Z Entering 'third_party/FP16' 2025-12-04T12:46:39.0504400Z Entering 'third_party/FXdiv' 2025-12-04T12:46:39.0526327Z Entering 'third_party/NNPACK' 2025-12-04T12:46:39.0550561Z Entering 'third_party/NVTX' 2025-12-04T12:46:39.0576685Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:39.0600654Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:39.0628878Z Entering 'third_party/aiter' 2025-12-04T12:46:39.0664129Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:39.0704422Z Entering 'third_party/benchmark' 2025-12-04T12:46:39.0734764Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:39.0772886Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:39.0805430Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:39.0833642Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:39.0866024Z Entering 'third_party/cutlass' 2025-12-04T12:46:39.0900659Z Entering 'third_party/fbgemm' 2025-12-04T12:46:39.0937129Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:39.0968628Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:39.0998883Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:39.1022997Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:39.1052527Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:39.1088851Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:39.1131405Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:39.1170102Z Entering 'third_party/flash-attention' 2025-12-04T12:46:39.1199528Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:39.1227004Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:39.1254089Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:39.1278819Z Entering 'third_party/fmt' 2025-12-04T12:46:39.1305544Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:39.1328523Z Entering 'third_party/gloo' 2025-12-04T12:46:39.1355113Z Entering 'third_party/googletest' 2025-12-04T12:46:39.1375031Z Entering 'third_party/ideep' 2025-12-04T12:46:39.1395346Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:39.1432533Z Entering 'third_party/ittapi' 2025-12-04T12:46:39.1455016Z Entering 'third_party/kineto' 2025-12-04T12:46:39.1481332Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:39.1513491Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:39.1537439Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:39.1562196Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:39.1590810Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:39.1617046Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:39.1642890Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:39.1669389Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:39.1691543Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:39.1714727Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:39.1745913Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:39.1771269Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.1794762Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.1825683Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:39.1848045Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:39.1869449Z Entering 'third_party/kleidiai' 2025-12-04T12:46:39.1891486Z Entering 'third_party/mimalloc' 2025-12-04T12:46:39.1914633Z Entering 'third_party/nlohmann' 2025-12-04T12:46:39.1938926Z Entering 'third_party/onnx' 2025-12-04T12:46:39.1975068Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:39.2011860Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:39.2037667Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:39.2060495Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:39.2083012Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:39.2107203Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:39.2128935Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:39.2151544Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:39.2173082Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:39.2192791Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.2215969Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.2238427Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:39.2274217Z Entering 'third_party/pocketfft' 2025-12-04T12:46:39.2297313Z Entering 'third_party/protobuf' 2025-12-04T12:46:39.2332847Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:39.2364926Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:39.2392056Z Entering 'third_party/psimd' 2025-12-04T12:46:39.2413749Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:39.2436235Z Entering 'third_party/pybind11' 2025-12-04T12:46:39.2460969Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:39.2488410Z Entering 'third_party/sleef' 2025-12-04T12:46:39.2513488Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:39.2536141Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:39.2569517Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:39.2595897Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:39.2624724Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:39.2645539Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:39.2690774Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T12:46:39.2867281Z Entering 'android/libs/fbjni' 2025-12-04T12:46:39.2889001Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T12:46:39.2900081Z Entering 'third_party/FP16' 2025-12-04T12:46:39.2921647Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T12:46:39.2931533Z Entering 'third_party/FXdiv' 2025-12-04T12:46:39.2952233Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T12:46:39.2962502Z Entering 'third_party/NNPACK' 2025-12-04T12:46:39.2981704Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T12:46:39.2991541Z Entering 'third_party/NVTX' 2025-12-04T12:46:39.3013014Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T12:46:39.3022992Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:39.3043909Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T12:46:39.3053888Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:39.3074262Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T12:46:39.3088983Z Entering 'third_party/aiter' 2025-12-04T12:46:39.3111380Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T12:46:39.3129596Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:39.3166037Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T12:46:39.3179571Z Entering 'third_party/benchmark' 2025-12-04T12:46:39.3205361Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:39.3215917Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:39.3235035Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T12:46:39.3246843Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:39.3268650Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T12:46:39.3278244Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:39.3299963Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T12:46:39.3309168Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:39.3329581Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T12:46:39.3338789Z Entering 'third_party/cutlass' 2025-12-04T12:46:39.3365784Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T12:46:39.3378759Z Entering 'third_party/fbgemm' 2025-12-04T12:46:39.3397261Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T12:46:39.3412990Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:39.3446282Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T12:46:39.3455772Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:39.3477713Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T12:46:39.3499371Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:39.3522856Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T12:46:39.3533316Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:39.3555142Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T12:46:39.3572034Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:39.3594184Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T12:46:39.3603691Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:39.3627809Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T12:46:39.3637138Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:39.3658331Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T12:46:39.3670558Z Entering 'third_party/flash-attention' 2025-12-04T12:46:39.3690386Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T12:46:39.3705637Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:39.3729407Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T12:46:39.3743708Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:39.3766979Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T12:46:39.3784119Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:39.3807635Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T12:46:39.3819512Z Entering 'third_party/fmt' 2025-12-04T12:46:39.3840889Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:39.3850499Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:39.3869804Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T12:46:39.3879800Z Entering 'third_party/gloo' 2025-12-04T12:46:39.3899417Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T12:46:39.3908978Z Entering 'third_party/googletest' 2025-12-04T12:46:39.3932457Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.3942110Z Entering 'third_party/ideep' 2025-12-04T12:46:39.3961055Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T12:46:39.3970770Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:39.3991802Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T12:46:39.4010311Z Entering 'third_party/ittapi' 2025-12-04T12:46:39.4037621Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T12:46:39.4047839Z Entering 'third_party/kineto' 2025-12-04T12:46:39.4081913Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T12:46:39.4091849Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:39.4115105Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T12:46:39.4124782Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:39.4144620Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T12:46:39.4158460Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:39.4178862Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T12:46:39.4188518Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:39.4211242Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:39.4221083Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:39.4239999Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T12:46:39.4249151Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:39.4270245Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T12:46:39.4282082Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:39.4310892Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T12:46:39.4320881Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:39.4339949Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.4350891Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:39.4373720Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T12:46:39.4383979Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:39.4404801Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T12:46:39.4414982Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:39.4436667Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:39.4446684Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.4468230Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:39.4478845Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.4498182Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:39.4517201Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:39.4540569Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T12:46:39.4550313Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:39.4570620Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.4585294Z Entering 'third_party/kleidiai' 2025-12-04T12:46:39.4605371Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T12:46:39.4614724Z Entering 'third_party/mimalloc' 2025-12-04T12:46:39.4634514Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T12:46:39.4644023Z Entering 'third_party/nlohmann' 2025-12-04T12:46:39.4665129Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T12:46:39.4675663Z Entering 'third_party/onnx' 2025-12-04T12:46:39.4695201Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T12:46:39.4709840Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:39.4730057Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:39.4743391Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:39.4769603Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T12:46:39.4780777Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:39.4816039Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:39.4825201Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:39.4850199Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.4859873Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:39.4879039Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T12:46:39.4889228Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:39.4909191Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T12:46:39.4922229Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:39.4942769Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T12:46:39.4959747Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:39.4981143Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T12:46:39.4996514Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:39.5020758Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:39.5030527Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.5067127Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:39.5080573Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.5105069Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:39.5117632Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:39.5138963Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T12:46:39.5157672Z Entering 'third_party/pocketfft' 2025-12-04T12:46:39.5179794Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T12:46:39.5189565Z Entering 'third_party/protobuf' 2025-12-04T12:46:39.5209108Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T12:46:39.5220769Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:39.5243458Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:39.5253576Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:39.5275107Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.5288538Z Entering 'third_party/psimd' 2025-12-04T12:46:39.5307991Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T12:46:39.5316436Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:39.5343679Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T12:46:39.5362253Z Entering 'third_party/pybind11' 2025-12-04T12:46:39.5395494Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:39.5407250Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:39.5433432Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T12:46:39.5450939Z Entering 'third_party/sleef' 2025-12-04T12:46:39.5479944Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T12:46:39.5500192Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:39.5522688Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T12:46:39.5536422Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:39.5567181Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:39.5576083Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:39.5599487Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T12:46:39.5609466Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:39.5642081Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T12:46:39.5656350Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:39.5677679Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:39.5687580Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:39.5706069Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T12:46:39.5885714Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T12:46:39.6089797Z Entering 'android/libs/fbjni' 2025-12-04T12:46:39.6109979Z Entering 'third_party/FP16' 2025-12-04T12:46:39.6143420Z Entering 'third_party/FXdiv' 2025-12-04T12:46:39.6170136Z Entering 'third_party/NNPACK' 2025-12-04T12:46:39.6192521Z Entering 'third_party/NVTX' 2025-12-04T12:46:39.6218730Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:39.6249444Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:39.6282312Z Entering 'third_party/aiter' 2025-12-04T12:46:39.6313741Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:39.6354480Z Entering 'third_party/benchmark' 2025-12-04T12:46:39.6380363Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:39.6410446Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:39.6432774Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:39.6458336Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:39.6485908Z Entering 'third_party/cutlass' 2025-12-04T12:46:39.6512115Z Entering 'third_party/fbgemm' 2025-12-04T12:46:39.6542353Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:39.6569870Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:39.6600127Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:39.6623173Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:39.6649214Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:39.6670044Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:39.6698659Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:39.6727943Z Entering 'third_party/flash-attention' 2025-12-04T12:46:39.6749554Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:39.6769966Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:39.6801809Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:39.6828147Z Entering 'third_party/fmt' 2025-12-04T12:46:39.6849588Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:39.6876892Z Entering 'third_party/gloo' 2025-12-04T12:46:39.6901459Z Entering 'third_party/googletest' 2025-12-04T12:46:39.6924057Z Entering 'third_party/ideep' 2025-12-04T12:46:39.6942278Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:39.6974866Z Entering 'third_party/ittapi' 2025-12-04T12:46:39.6998810Z Entering 'third_party/kineto' 2025-12-04T12:46:39.7022335Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:39.7046833Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:39.7068273Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:39.7088472Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:39.7109768Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:39.7133392Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:39.7158634Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:39.7182653Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:39.7204769Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:39.7226576Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:39.7248533Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:39.7270147Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.7301430Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.7327688Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:39.7351549Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:39.7374615Z Entering 'third_party/kleidiai' 2025-12-04T12:46:39.7397888Z Entering 'third_party/mimalloc' 2025-12-04T12:46:39.7421032Z Entering 'third_party/nlohmann' 2025-12-04T12:46:39.7443171Z Entering 'third_party/onnx' 2025-12-04T12:46:39.7473573Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:39.7500390Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:39.7525059Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:39.7551140Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:39.7578500Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:39.7598979Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:39.7619377Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:39.7640258Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:39.7659171Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:39.7678358Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.7701784Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.7731617Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:39.7758375Z Entering 'third_party/pocketfft' 2025-12-04T12:46:39.7782286Z Entering 'third_party/protobuf' 2025-12-04T12:46:39.7805320Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:39.7827795Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:39.7851727Z Entering 'third_party/psimd' 2025-12-04T12:46:39.7880542Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:39.7910728Z Entering 'third_party/pybind11' 2025-12-04T12:46:39.7936604Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:39.7959794Z Entering 'third_party/sleef' 2025-12-04T12:46:39.7981043Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:39.8006110Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:39.8030308Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:39.8053481Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:39.8078733Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:39.8105647Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:39.8146010Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T12:46:39.8323386Z Entering 'android/libs/fbjni' 2025-12-04T12:46:39.8345052Z Entering 'third_party/FP16' 2025-12-04T12:46:39.8373179Z Entering 'third_party/FXdiv' 2025-12-04T12:46:39.8396193Z Entering 'third_party/NNPACK' 2025-12-04T12:46:39.8418102Z Entering 'third_party/NVTX' 2025-12-04T12:46:39.8439724Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:39.8459864Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:39.8487315Z Entering 'third_party/aiter' 2025-12-04T12:46:39.8508118Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:39.8537419Z Entering 'third_party/benchmark' 2025-12-04T12:46:39.8564155Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:39.8589777Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:39.8610625Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:39.8631381Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:39.8650403Z Entering 'third_party/cutlass' 2025-12-04T12:46:39.8671815Z Entering 'third_party/fbgemm' 2025-12-04T12:46:39.8698436Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:39.8718789Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:39.8739856Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:39.8761884Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:39.8799474Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:39.8827029Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:39.8848632Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:39.8873898Z Entering 'third_party/flash-attention' 2025-12-04T12:46:39.8903777Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:39.8927873Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:39.8953772Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:39.8980585Z Entering 'third_party/fmt' 2025-12-04T12:46:39.9004661Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:39.9028978Z Entering 'third_party/gloo' 2025-12-04T12:46:39.9047765Z Entering 'third_party/googletest' 2025-12-04T12:46:39.9066135Z Entering 'third_party/ideep' 2025-12-04T12:46:39.9086449Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:39.9108657Z Entering 'third_party/ittapi' 2025-12-04T12:46:39.9127223Z Entering 'third_party/kineto' 2025-12-04T12:46:39.9150606Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:39.9181573Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:39.9201708Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:39.9222927Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:39.9241926Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:39.9259259Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:39.9287646Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:39.9307980Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:39.9328346Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:39.9351156Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:39.9370836Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:39.9390170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.9418893Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.9445000Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:39.9467463Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:39.9488406Z Entering 'third_party/kleidiai' 2025-12-04T12:46:39.9507693Z Entering 'third_party/mimalloc' 2025-12-04T12:46:39.9535654Z Entering 'third_party/nlohmann' 2025-12-04T12:46:39.9556431Z Entering 'third_party/onnx' 2025-12-04T12:46:39.9581826Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:39.9607228Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:39.9627123Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:39.9654345Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:39.9673378Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:39.9693290Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:39.9720082Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:39.9738771Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:39.9756779Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:39.9775046Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:39.9796162Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:39.9816266Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:39.9843558Z Entering 'third_party/pocketfft' 2025-12-04T12:46:39.9864202Z Entering 'third_party/protobuf' 2025-12-04T12:46:39.9895740Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:39.9915940Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:39.9943015Z Entering 'third_party/psimd' 2025-12-04T12:46:39.9963310Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:39.9988659Z Entering 'third_party/pybind11' 2025-12-04T12:46:40.0009680Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:40.0031843Z Entering 'third_party/sleef' 2025-12-04T12:46:40.0051167Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:40.0074493Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:40.0103626Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:40.0125127Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:40.0145200Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:40.0163413Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:40.0196970Z ##[endgroup] 2025-12-04T12:46:40.0392147Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T12:46:40.0617071Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:40.0755930Z ##[group]Run actions/checkout@v4 2025-12-04T12:46:40.0756071Z with: 2025-12-04T12:46:40.0756186Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:40.0756324Z fetch-depth: 0 2025-12-04T12:46:40.0756437Z submodules: recursive 2025-12-04T12:46:40.0756549Z show-progress: false 2025-12-04T12:46:40.0756669Z repository: pytorch/pytorch 2025-12-04T12:46:40.0756821Z token: *** 2025-12-04T12:46:40.0756919Z ssh-strict: true 2025-12-04T12:46:40.0757014Z ssh-user: git 2025-12-04T12:46:40.0757118Z persist-credentials: true 2025-12-04T12:46:40.0757241Z clean: true 2025-12-04T12:46:40.0757341Z sparse-checkout-cone-mode: true 2025-12-04T12:46:40.0757462Z fetch-tags: false 2025-12-04T12:46:40.0757608Z lfs: false 2025-12-04T12:46:40.0757702Z set-safe-directory: true 2025-12-04T12:46:40.0757807Z env: 2025-12-04T12:46:40.0757896Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:40.0758005Z ##[endgroup] 2025-12-04T12:46:40.1227749Z Syncing repository: pytorch/pytorch 2025-12-04T12:46:40.1227999Z ##[group]Getting Git version info 2025-12-04T12:46:40.1228165Z Working directory is '/home/runner/_work/pytorch/pytorch' 2025-12-04T12:46:40.1243365Z [command]/usr/bin/git version 2025-12-04T12:46:40.1268589Z git version 2.52.0 2025-12-04T12:46:40.1282037Z ##[endgroup] 2025-12-04T12:46:40.1287131Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/ce66c873-cf0c-4683-b5a0-12bd4f08d100/.gitconfig' 2025-12-04T12:46:40.1292296Z Temporarily overriding HOME='/home/runner/_work/_temp/ce66c873-cf0c-4683-b5a0-12bd4f08d100' before making global git config changes 2025-12-04T12:46:40.1292640Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T12:46:40.1300367Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T12:46:40.1330522Z [command]/usr/bin/git config --local --get remote.origin.url 2025-12-04T12:46:40.1352757Z https://github.com/pytorch/pytorch 2025-12-04T12:46:40.1367200Z ##[group]Removing previously created refs, to avoid conflicts 2025-12-04T12:46:40.1369989Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-12-04T12:46:40.1385717Z HEAD 2025-12-04T12:46:40.1416003Z ##[endgroup] 2025-12-04T12:46:40.1418553Z [command]/usr/bin/git submodule status 2025-12-04T12:46:40.1634420Z 7e1e1fe3858c63c251c637ae41a20de425dde96f android/libs/fbjni (v0.1.0-12-g7e1e1fe) 2025-12-04T12:46:40.1683324Z 4dfe081cf6bcd15db339cf2680b9281b8451eeb3 third_party/FP16 (4dfe081) 2025-12-04T12:46:40.1732571Z b408327ac2a15ec3e43352421954f5b1967701d1 third_party/FXdiv (b408327) 2025-12-04T12:46:40.1789877Z c07e3a0400713d546e0dea2d5466dd22ea389c73 third_party/NNPACK (c07e3a0) 2025-12-04T12:46:40.1829404Z 3ebbc93ded7285963bff932c678fa367eb393ba6 third_party/NVTX (v3.1.0-313-g3ebbc93) 2025-12-04T12:46:40.1879583Z 1d8f600fd424278486eade7ed3e877c99f0846b1 third_party/VulkanMemoryAllocator (v2.1.0-982-g1d8f600) 2025-12-04T12:46:40.2180624Z 51a0103656eff6fc9bfd39a4597923c4b542c883 third_party/XNNPACK (remotes/origin/ds/ndk-1243-g51a0103656) 2025-12-04T12:46:40.2209483Z 01aae101b9e5e94d6c16a9514c9fb8df99c93150 third_party/aiter (v0.1.1-92-g01aae101) 2025-12-04T12:46:40.2229584Z 299e5928955cc62af9968370293b916f5130916f third_party/benchmark (v1.9.3) 2025-12-04T12:46:40.2299939Z 7fe50dc3da2069d6645d9deb8c017a876472a977 third_party/composable_kernel (rocm-6.4.3-459-g7fe50dc3d) 2025-12-04T12:46:40.2399284Z 89c932f313c6437c38f2982869beacc89c2f2246 third_party/cpp-httplib (v0.26.0) 2025-12-04T12:46:40.2505531Z f858c30bcb16f8effd5ff46996f0514539e17abc third_party/cpuinfo (f858c30) 2025-12-04T12:46:40.2537235Z 0b1577c8c83401237d601d0d0db5210506705396 third_party/cudnn_frontend (v0.5-61-g0b1577c) 2025-12-04T12:46:40.2599707Z f88806b1e31dfa579842638740216dd41fc6c588 third_party/cutlass (v4.3.1) 2025-12-04T12:46:40.2630758Z c0b988d39a9e47c794d699f29930ed4d7c7e13a4 third_party/fbgemm (v1.4.0-rc1-2-gc0b988d39) 2025-12-04T12:46:40.2689480Z 979702c87a8713a8e0a5e9fee122b90d2ef13be5 third_party/flash-attention (v2.7.4) 2025-12-04T12:46:40.2703906Z a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757 third_party/flatbuffers (v24.12.23) 2025-12-04T12:46:40.2940784Z 407c905e45ad75fc29bf0f9bb7c5c2fd3475976f third_party/fmt (12.1.0) 2025-12-04T12:46:40.3017970Z 3fb5c176c17c765a3492cd2f0321b0dab712f350 third_party/gemmlowp/gemmlowp (remotes/origin/revert-87-master-135-g3fb5c17) 2025-12-04T12:46:40.3091478Z 54cbae0d3a67fa890b4c3d9ee162b7860315e341 third_party/gloo (remotes/origin/gh/c-p-i-o/1/base-37-g54cbae0) 2025-12-04T12:46:40.3228576Z 52eb8108c5bdec04579160ae17225d66034bd723 third_party/googletest (release-1.8.0-3544-g52eb8108) 2025-12-04T12:46:40.3277149Z 719d8e6cd7f7a0e01b155657526d693acf97c2b3 third_party/ideep (pytorch-rls-v3.7.1) 2025-12-04T12:46:40.3316096Z dec1d23ca65ab069d225dfe40dea14f455170959 third_party/ittapi (v3.25.5) 2025-12-04T12:46:40.3455076Z 31f85df8fbd89c188f14ef10f1ec65379786b943 third_party/kineto (heads/main) 2025-12-04T12:46:40.3472868Z d7770c89632329a9914ef1a90289917597639cbe third_party/kleidiai (v1.15.0) 2025-12-04T12:46:40.3498384Z fbd8b99c2b828428947d70fdc046bb55609be93e third_party/mimalloc (v2.2.4) 2025-12-04T12:46:40.3511691Z 55f93686c01528224f448c19128836e7df245f72 third_party/nlohmann (v3.12.0) 2025-12-04T12:46:40.3704200Z e709452ef2bbc1d113faf678c24e6d3467696e83 third_party/onnx (v1.18.0) 2025-12-04T12:46:40.3730192Z a799f4aed9c94b765dcdaabaeab7d5e7e2310878 third_party/opentelemetry-cpp (v1.14.2) 2025-12-04T12:46:40.3746858Z 0fa0ef591e38c2758e3184c6c23e497b9f732ffa third_party/pocketfft (release_for_eigen-40-g0fa0ef5) 2025-12-04T12:46:40.3963203Z d1eca4e4b421cd2997495c4b4e65cea6be4e9b8a third_party/protobuf (v3.7.0-rc.2-1279-gd1eca4e4b) 2025-12-04T12:46:40.4014564Z 072586a71b55b7f8c584153d223e95687148a900 third_party/psimd (heads/master) 2025-12-04T12:46:40.4058077Z 4fe0e1e183925bf8cfa6aae24237e724a96479b8 third_party/pthreadpool (0.1-144-g4fe0e1e) 2025-12-04T12:46:40.4075364Z f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8 third_party/pybind11 (v3.0.1) 2025-12-04T12:46:40.4131792Z f45429b087dd7d5bc78bb40dc7cf06425c252d67 third_party/python-peachpy (remotes/origin/pre-generated) 2025-12-04T12:46:40.4179656Z 5a1d179df9cf652951b59010a2d2075372d67f68 third_party/sleef (3.8) 2025-12-04T12:46:40.4227251Z 2b4cd91092d335a697416b2a3cb398283246849d third_party/tensorpipe (heads/main) 2025-12-04T12:46:40.4236122Z ##[group]Cleaning the repository 2025-12-04T12:46:40.4239742Z [command]/usr/bin/git clean -ffdx 2025-12-04T12:46:40.4359874Z [command]/usr/bin/git reset --hard HEAD 2025-12-04T12:46:40.5162720Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T12:46:40.5236523Z ##[endgroup] 2025-12-04T12:46:40.5239060Z ##[group]Disabling automatic garbage collection 2025-12-04T12:46:40.5243229Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T12:46:40.5268924Z ##[endgroup] 2025-12-04T12:46:40.5269166Z ##[group]Setting up auth 2025-12-04T12:46:40.5272655Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T12:46:40.5296953Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T12:46:40.5515822Z Entering 'android/libs/fbjni' 2025-12-04T12:46:40.5544841Z Entering 'third_party/FP16' 2025-12-04T12:46:40.5578166Z Entering 'third_party/FXdiv' 2025-12-04T12:46:40.5609576Z Entering 'third_party/NNPACK' 2025-12-04T12:46:40.5641005Z Entering 'third_party/NVTX' 2025-12-04T12:46:40.5674429Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:40.5704852Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:40.5738651Z Entering 'third_party/aiter' 2025-12-04T12:46:40.5764800Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:40.5793701Z Entering 'third_party/benchmark' 2025-12-04T12:46:40.5817745Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:40.5846204Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:40.5873830Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:40.5895383Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:40.5920372Z Entering 'third_party/cutlass' 2025-12-04T12:46:40.5951523Z Entering 'third_party/fbgemm' 2025-12-04T12:46:40.5976443Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:40.6002806Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:40.6025619Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:40.6045938Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:40.6068731Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:40.6097704Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:40.6125626Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:40.6148798Z Entering 'third_party/flash-attention' 2025-12-04T12:46:40.6175605Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:40.6199037Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:40.6225920Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:40.6265021Z Entering 'third_party/fmt' 2025-12-04T12:46:40.6292615Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:40.6314179Z Entering 'third_party/gloo' 2025-12-04T12:46:40.6342167Z Entering 'third_party/googletest' 2025-12-04T12:46:40.6371640Z Entering 'third_party/ideep' 2025-12-04T12:46:40.6398069Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:40.6440501Z Entering 'third_party/ittapi' 2025-12-04T12:46:40.6471979Z Entering 'third_party/kineto' 2025-12-04T12:46:40.6497524Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:40.6520718Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:40.6546239Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:40.6570908Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:40.6596105Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:40.6620240Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:40.6647383Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:40.6679426Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:40.6705566Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:40.6729519Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:40.6756315Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:40.6782404Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:40.6804262Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:40.6830494Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:40.6876534Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:40.6906952Z Entering 'third_party/kleidiai' 2025-12-04T12:46:40.6942166Z Entering 'third_party/mimalloc' 2025-12-04T12:46:40.6966915Z Entering 'third_party/nlohmann' 2025-12-04T12:46:40.6991095Z Entering 'third_party/onnx' 2025-12-04T12:46:40.7019806Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:40.7055427Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:40.7081736Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:40.7119181Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:40.7148315Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:40.7169128Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:40.7191778Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:40.7216519Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:40.7237886Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:40.7260428Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:40.7282971Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:40.7307004Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:40.7340263Z Entering 'third_party/pocketfft' 2025-12-04T12:46:40.7363314Z Entering 'third_party/protobuf' 2025-12-04T12:46:40.7388503Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:40.7425997Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:40.7451670Z Entering 'third_party/psimd' 2025-12-04T12:46:40.7479990Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:40.7502710Z Entering 'third_party/pybind11' 2025-12-04T12:46:40.7527132Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:40.7552386Z Entering 'third_party/sleef' 2025-12-04T12:46:40.7575830Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:40.7600075Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:40.7621009Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:40.7642757Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:40.7666510Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:40.7687144Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:40.7745377Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T12:46:40.7761637Z http.https://github.com/.extraheader 2025-12-04T12:46:40.7768664Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T12:46:40.7789059Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T12:46:40.7972776Z Entering 'android/libs/fbjni' 2025-12-04T12:46:40.7999188Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8023557Z Entering 'third_party/FP16' 2025-12-04T12:46:40.8040520Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8062948Z Entering 'third_party/FXdiv' 2025-12-04T12:46:40.8082479Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8100342Z Entering 'third_party/NNPACK' 2025-12-04T12:46:40.8113173Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8131126Z Entering 'third_party/NVTX' 2025-12-04T12:46:40.8146610Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8164030Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:40.8177930Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8202911Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:40.8216790Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8243866Z Entering 'third_party/aiter' 2025-12-04T12:46:40.8258641Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8280390Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:40.8300019Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8326782Z Entering 'third_party/benchmark' 2025-12-04T12:46:40.8340863Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8358465Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:40.8371941Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8403132Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:40.8416533Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8433317Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:40.8449581Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8467929Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:40.8486060Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8503413Z Entering 'third_party/cutlass' 2025-12-04T12:46:40.8518035Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8545482Z Entering 'third_party/fbgemm' 2025-12-04T12:46:40.8558933Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8579076Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:40.8592577Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8610158Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:40.8625103Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8656281Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:40.8672497Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8693286Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:40.8707877Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8730675Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:40.8747910Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8764602Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:40.8776957Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8792448Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:40.8804821Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8822893Z Entering 'third_party/flash-attention' 2025-12-04T12:46:40.8836239Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8852687Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:40.8879198Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8900962Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:40.8917391Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8944955Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:40.8958255Z http.https://github.com/.extraheader 2025-12-04T12:46:40.8980360Z Entering 'third_party/fmt' 2025-12-04T12:46:40.8993732Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9009172Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:40.9021314Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9040032Z Entering 'third_party/gloo' 2025-12-04T12:46:40.9055595Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9079238Z Entering 'third_party/googletest' 2025-12-04T12:46:40.9094055Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9112741Z Entering 'third_party/ideep' 2025-12-04T12:46:40.9133120Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9154418Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:40.9176035Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9211539Z Entering 'third_party/ittapi' 2025-12-04T12:46:40.9224585Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9243765Z Entering 'third_party/kineto' 2025-12-04T12:46:40.9255708Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9273507Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:40.9286001Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9307751Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:40.9322400Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9342212Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:40.9357635Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9374116Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:40.9387147Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9407472Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:40.9419142Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9434651Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:40.9452842Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9470786Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:40.9498517Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9518119Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:40.9532956Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9550671Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:40.9564601Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9582051Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:40.9596084Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9619034Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:40.9632683Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9658281Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:40.9680672Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9700184Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:40.9713814Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9743049Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:40.9760364Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9777461Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:40.9790529Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9811625Z Entering 'third_party/kleidiai' 2025-12-04T12:46:40.9825136Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9844938Z Entering 'third_party/mimalloc' 2025-12-04T12:46:40.9859584Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9877127Z Entering 'third_party/nlohmann' 2025-12-04T12:46:40.9890037Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9914934Z Entering 'third_party/onnx' 2025-12-04T12:46:40.9931990Z http.https://github.com/.extraheader 2025-12-04T12:46:40.9955058Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:40.9981369Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0007435Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:41.0020913Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0037166Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:41.0057994Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0077276Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:41.0091675Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0114222Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:41.0125223Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0141384Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:41.0155274Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0176957Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:41.0196681Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0214599Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:41.0231693Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0253287Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:41.0267825Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0283768Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:41.0299557Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0325715Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:41.0340615Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0361626Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:41.0373477Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0404416Z Entering 'third_party/pocketfft' 2025-12-04T12:46:41.0417307Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0433257Z Entering 'third_party/protobuf' 2025-12-04T12:46:41.0446072Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0472484Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:41.0490465Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0509726Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:41.0529270Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0548159Z Entering 'third_party/psimd' 2025-12-04T12:46:41.0561546Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0581194Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:41.0594942Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0612938Z Entering 'third_party/pybind11' 2025-12-04T12:46:41.0630494Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0659630Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:41.0674378Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0698887Z Entering 'third_party/sleef' 2025-12-04T12:46:41.0713047Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0745448Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:41.0760265Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0782431Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:41.0796851Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0820809Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:41.0834588Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0855641Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:41.0872170Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0901386Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:41.0918069Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0938378Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:41.0953317Z http.https://github.com/.extraheader 2025-12-04T12:46:41.0999428Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.1022349Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T12:46:41.1179771Z Entering 'android/libs/fbjni' 2025-12-04T12:46:41.1192367Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T12:46:41.1200528Z Entering 'third_party/FP16' 2025-12-04T12:46:41.1210607Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T12:46:41.1219139Z Entering 'third_party/FXdiv' 2025-12-04T12:46:41.1228997Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T12:46:41.1237205Z Entering 'third_party/NNPACK' 2025-12-04T12:46:41.1246885Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T12:46:41.1255790Z Entering 'third_party/NVTX' 2025-12-04T12:46:41.1264991Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T12:46:41.1273480Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:41.1282337Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T12:46:41.1291248Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:41.1301205Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T12:46:41.1316376Z Entering 'third_party/aiter' 2025-12-04T12:46:41.1325461Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T12:46:41.1337062Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:41.1347739Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T12:46:41.1362803Z Entering 'third_party/benchmark' 2025-12-04T12:46:41.1372196Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:41.1380882Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:41.1389712Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T12:46:41.1402034Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:41.1411939Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T12:46:41.1421559Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:41.1431560Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T12:46:41.1440677Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:41.1450489Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T12:46:41.1458594Z Entering 'third_party/cutlass' 2025-12-04T12:46:41.1467562Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T12:46:41.1479620Z Entering 'third_party/fbgemm' 2025-12-04T12:46:41.1489063Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T12:46:41.1497460Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:41.1508516Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T12:46:41.1516257Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:41.1526180Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T12:46:41.1538451Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:41.1552748Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T12:46:41.1557979Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:41.1568022Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T12:46:41.1579267Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:41.1590030Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T12:46:41.1597942Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:41.1608897Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T12:46:41.1617853Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:41.1628206Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T12:46:41.1641052Z Entering 'third_party/flash-attention' 2025-12-04T12:46:41.1652920Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T12:46:41.1663031Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:41.1672084Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T12:46:41.1683595Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:41.1695395Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T12:46:41.1708312Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:41.1717964Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T12:46:41.1727390Z Entering 'third_party/fmt' 2025-12-04T12:46:41.1737627Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:41.1746646Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:41.1756698Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T12:46:41.1764659Z Entering 'third_party/gloo' 2025-12-04T12:46:41.1776386Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T12:46:41.1786081Z Entering 'third_party/googletest' 2025-12-04T12:46:41.1795503Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.1803908Z Entering 'third_party/ideep' 2025-12-04T12:46:41.1815419Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T12:46:41.1823791Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:41.1836064Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T12:46:41.1848970Z Entering 'third_party/ittapi' 2025-12-04T12:46:41.1859644Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T12:46:41.1871604Z Entering 'third_party/kineto' 2025-12-04T12:46:41.1881470Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T12:46:41.1890964Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:41.1899936Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T12:46:41.1909729Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:41.1924405Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T12:46:41.1934139Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:41.1944157Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T12:46:41.1954810Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:41.1965350Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:41.1974262Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:41.1986581Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T12:46:41.1996918Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:41.2012237Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T12:46:41.2023261Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:41.2032512Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T12:46:41.2040557Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:41.2051642Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.2062829Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:41.2071911Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T12:46:41.2080829Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:41.2091588Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T12:46:41.2102658Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:41.2111975Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:41.2120872Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:41.2129923Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:41.2141137Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:41.2151670Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:41.2163168Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:41.2172102Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T12:46:41.2179948Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:41.2188440Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.2198497Z Entering 'third_party/kleidiai' 2025-12-04T12:46:41.2207406Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T12:46:41.2216728Z Entering 'third_party/mimalloc' 2025-12-04T12:46:41.2225777Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T12:46:41.2235295Z Entering 'third_party/nlohmann' 2025-12-04T12:46:41.2244519Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T12:46:41.2253585Z Entering 'third_party/onnx' 2025-12-04T12:46:41.2262876Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T12:46:41.2280899Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:41.2294363Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:41.2306448Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:41.2316107Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T12:46:41.2325770Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:41.2334279Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:41.2342305Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:41.2350741Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.2362473Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:41.2374383Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T12:46:41.2382781Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:41.2393208Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T12:46:41.2402471Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:41.2414296Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T12:46:41.2422651Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:41.2432364Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T12:46:41.2444356Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:41.2454010Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:41.2461570Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:41.2470946Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:41.2484514Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:41.2494627Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:41.2504517Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:41.2513156Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T12:46:41.2529791Z Entering 'third_party/pocketfft' 2025-12-04T12:46:41.2542633Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T12:46:41.2551544Z Entering 'third_party/protobuf' 2025-12-04T12:46:41.2561606Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T12:46:41.2571358Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:41.2580699Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:41.2591059Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:41.2605650Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.2618335Z Entering 'third_party/psimd' 2025-12-04T12:46:41.2634092Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T12:46:41.2644070Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:41.2654517Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T12:46:41.2665723Z Entering 'third_party/pybind11' 2025-12-04T12:46:41.2675215Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:41.2684235Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:41.2693545Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T12:46:41.2703762Z Entering 'third_party/sleef' 2025-12-04T12:46:41.2713668Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T12:46:41.2722543Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:41.2732850Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T12:46:41.2741953Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:41.2764173Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:41.2773226Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:41.2785207Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T12:46:41.2797796Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:41.2809373Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T12:46:41.2819353Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:41.2830057Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:41.2838486Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:41.2847183Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T12:46:41.2875188Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2895837Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2915266Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2932518Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2947186Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2962428Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2976499Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.2994451Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3010919Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3027874Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3048418Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3066951Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3089104Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3107577Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3129941Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3146639Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3164465Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3180275Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3196255Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3211929Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3229098Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3244663Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3259680Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3273755Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3288860Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3310153Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3330379Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3345598Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3362999Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3378858Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3397345Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3412847Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3427925Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3444592Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3464161Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3487624Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3506771Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3524642Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3543629Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3561411Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3578823Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3598069Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3618819Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3637081Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3654932Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3670875Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3692400Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3709558Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3725795Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3743513Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3764175Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3780947Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3801540Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3821451Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3838433Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3855844Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3872186Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3896871Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3917163Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3938272Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3954865Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3976146Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.3993708Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4012481Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4032396Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4049516Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4066481Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4083769Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4101320Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4117725Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4134632Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4154504Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4173306Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4190453Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4207054Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4223692Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4241302Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4259769Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4278051Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4294743Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4311537Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T12:46:41.4331586Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T12:46:41.4357228Z ##[endgroup] 2025-12-04T12:46:41.4357564Z ##[group]Fetching the repository 2025-12-04T12:46:41.4361307Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T12:46:42.8553515Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T12:46:42.8826375Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:42.8832492Z ##[endgroup] 2025-12-04T12:46:42.8832942Z ##[group]Determining the checkout info 2025-12-04T12:46:42.8835122Z ##[endgroup] 2025-12-04T12:46:42.8841557Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T12:46:42.8947041Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T12:46:42.8972670Z ##[group]Checking out the ref 2025-12-04T12:46:42.8974585Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:42.9251400Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T12:46:42.9255305Z ##[endgroup] 2025-12-04T12:46:42.9255766Z ##[group]Setting up auth for fetching submodules 2025-12-04T12:46:42.9260101Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T12:46:42.9292899Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T12:46:42.9315573Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T12:46:42.9341263Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T12:46:42.9364155Z ##[endgroup] 2025-12-04T12:46:42.9364439Z ##[group]Fetching submodules 2025-12-04T12:46:42.9364855Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T12:46:42.9576453Z Synchronizing submodule url for 'android/libs/fbjni' 2025-12-04T12:46:42.9587853Z Synchronizing submodule url for 'third_party/FP16' 2025-12-04T12:46:42.9598613Z Synchronizing submodule url for 'third_party/FXdiv' 2025-12-04T12:46:42.9610290Z Synchronizing submodule url for 'third_party/NNPACK' 2025-12-04T12:46:42.9625335Z Synchronizing submodule url for 'third_party/NVTX' 2025-12-04T12:46:42.9636222Z Synchronizing submodule url for 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:42.9645803Z Synchronizing submodule url for 'third_party/XNNPACK' 2025-12-04T12:46:42.9663991Z Synchronizing submodule url for 'third_party/aiter' 2025-12-04T12:46:42.9675709Z Synchronizing submodule url for 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:42.9689248Z Synchronizing submodule url for 'third_party/benchmark' 2025-12-04T12:46:42.9700183Z Synchronizing submodule url for 'third_party/composable_kernel' 2025-12-04T12:46:42.9714587Z Synchronizing submodule url for 'third_party/cpp-httplib' 2025-12-04T12:46:42.9725003Z Synchronizing submodule url for 'third_party/cpuinfo' 2025-12-04T12:46:42.9736317Z Synchronizing submodule url for 'third_party/cudnn_frontend' 2025-12-04T12:46:42.9748202Z Synchronizing submodule url for 'third_party/cutlass' 2025-12-04T12:46:42.9762720Z Synchronizing submodule url for 'third_party/fbgemm' 2025-12-04T12:46:42.9776439Z Synchronizing submodule url for 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:42.9791356Z Synchronizing submodule url for 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:42.9811255Z Synchronizing submodule url for 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:42.9823533Z Synchronizing submodule url for 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:42.9845749Z Synchronizing submodule url for 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:42.9861735Z Synchronizing submodule url for 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:42.9872233Z Synchronizing submodule url for 'third_party/fbgemm/external/json' 2025-12-04T12:46:42.9884774Z Synchronizing submodule url for 'third_party/flash-attention' 2025-12-04T12:46:42.9899255Z Synchronizing submodule url for 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:42.9915692Z Synchronizing submodule url for 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:42.9931024Z Synchronizing submodule url for 'third_party/flatbuffers' 2025-12-04T12:46:42.9942163Z Synchronizing submodule url for 'third_party/fmt' 2025-12-04T12:46:42.9951495Z Synchronizing submodule url for 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:42.9960459Z Synchronizing submodule url for 'third_party/gloo' 2025-12-04T12:46:42.9969961Z Synchronizing submodule url for 'third_party/googletest' 2025-12-04T12:46:42.9978902Z Synchronizing submodule url for 'third_party/ideep' 2025-12-04T12:46:42.9992595Z Synchronizing submodule url for 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:43.0010768Z Synchronizing submodule url for 'third_party/ittapi' 2025-12-04T12:46:43.0020700Z Synchronizing submodule url for 'third_party/kineto' 2025-12-04T12:46:43.0029938Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:43.0041673Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:43.0053969Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:43.0068163Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:43.0077655Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:43.0091951Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:43.0101248Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:43.0112995Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:43.0124288Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:43.0135214Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:43.0146182Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:43.0157260Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:43.0167535Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:43.0182174Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:43.0202353Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:43.0222367Z Synchronizing submodule url for 'third_party/kleidiai' 2025-12-04T12:46:43.0233112Z Synchronizing submodule url for 'third_party/mimalloc' 2025-12-04T12:46:43.0247358Z Synchronizing submodule url for 'third_party/nlohmann' 2025-12-04T12:46:43.0260376Z Synchronizing submodule url for 'third_party/onnx' 2025-12-04T12:46:43.0277885Z Synchronizing submodule url for 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:43.0300209Z Synchronizing submodule url for 'third_party/opentelemetry-cpp' 2025-12-04T12:46:43.0312609Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:43.0328191Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:43.0339414Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:43.0348255Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:43.0357097Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:43.0367266Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:43.0382141Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:43.0396448Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:43.0411457Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:43.0423446Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:43.0452822Z Synchronizing submodule url for 'third_party/pocketfft' 2025-12-04T12:46:43.0462767Z Synchronizing submodule url for 'third_party/protobuf' 2025-12-04T12:46:43.0479733Z Synchronizing submodule url for 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:43.0498933Z Synchronizing submodule url for 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:43.0511390Z Synchronizing submodule url for 'third_party/psimd' 2025-12-04T12:46:43.0521224Z Synchronizing submodule url for 'third_party/pthreadpool' 2025-12-04T12:46:43.0531519Z Synchronizing submodule url for 'third_party/pybind11' 2025-12-04T12:46:43.0546682Z Synchronizing submodule url for 'third_party/python-peachpy' 2025-12-04T12:46:43.0559162Z Synchronizing submodule url for 'third_party/sleef' 2025-12-04T12:46:43.0569744Z Synchronizing submodule url for 'third_party/tensorpipe' 2025-12-04T12:46:43.0580861Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:43.0597457Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:43.0607357Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:43.0617429Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:43.0628629Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:43.0652007Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T12:46:43.0949708Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T12:46:43.1028456Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T12:46:43.1093869Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T12:46:43.1151815Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T12:46:43.1218075Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T12:46:43.1271779Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T12:46:43.1437821Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T12:46:43.1583048Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T12:46:43.1768310Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T12:46:43.1834205Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T12:46:43.2011607Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T12:46:43.2083961Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T12:46:43.2147104Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T12:46:43.2217169Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T12:46:43.2343368Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T12:46:43.2470949Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T12:46:43.2530188Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T12:46:43.2708441Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T12:46:43.2788227Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T12:46:43.2906191Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T12:46:43.2976649Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:43.3027832Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T12:46:43.3110451Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T12:46:43.3192348Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T12:46:43.3373473Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T12:46:43.3491069Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T12:46:43.3587605Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T12:46:43.3656516Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T12:46:43.3716097Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T12:46:43.3783661Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T12:46:43.3845430Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:43.3909577Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T12:46:43.4094933Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T12:46:43.4150618Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T12:46:43.4228691Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T12:46:43.4307195Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T12:46:43.4390682Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T12:46:43.4453357Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T12:46:43.4509167Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T12:46:43.4560358Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T12:46:43.4616815Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T12:46:43.4673459Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T12:46:43.4741968Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:43.4828009Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T12:46:43.4897340Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T12:46:43.4959191Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T12:46:43.5043455Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T12:46:43.5116040Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T12:46:43.5188308Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T12:46:43.5252005Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T12:46:43.5350326Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T12:46:43.5421137Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T12:46:43.5523244Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T12:46:43.5681702Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T12:46:43.5757874Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T12:46:43.5867589Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T12:46:43.5928507Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T12:46:43.5981570Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T12:46:43.6040073Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T12:46:43.6126543Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T12:46:43.6188275Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T12:46:43.6237845Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T12:46:43.6295094Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T12:46:43.6381330Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T12:46:43.6436274Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T12:46:43.6576287Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T12:46:43.6636685Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T12:46:43.6800764Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T12:46:43.6872295Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T12:46:43.6937877Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T12:46:43.6991717Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T12:46:43.7041883Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T12:46:43.7137627Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T12:46:43.7194275Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T12:46:43.7255596Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T12:46:43.7316629Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T12:46:43.7397331Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T12:46:43.7454523Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T12:46:43.7587049Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T12:46:43.7665938Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T12:46:43.7715903Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T12:46:43.7746743Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T12:46:43.7922455Z Entering 'android/libs/fbjni' 2025-12-04T12:46:43.7943536Z Entering 'third_party/FP16' 2025-12-04T12:46:43.7967356Z Entering 'third_party/FXdiv' 2025-12-04T12:46:43.7989187Z Entering 'third_party/NNPACK' 2025-12-04T12:46:43.8008538Z Entering 'third_party/NVTX' 2025-12-04T12:46:43.8031356Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:43.8053687Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:43.8089956Z Entering 'third_party/aiter' 2025-12-04T12:46:43.8118541Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:43.8144601Z Entering 'third_party/benchmark' 2025-12-04T12:46:43.8171918Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:43.8198413Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:43.8259613Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:43.8283604Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:43.8307902Z Entering 'third_party/cutlass' 2025-12-04T12:46:43.8332030Z Entering 'third_party/fbgemm' 2025-12-04T12:46:43.8357145Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:43.8379427Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:43.8414932Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:43.8438249Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:43.8469857Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:43.8496045Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:43.8528972Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:43.8552955Z Entering 'third_party/flash-attention' 2025-12-04T12:46:43.8579643Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:43.8608558Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:43.8634123Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:43.8661352Z Entering 'third_party/fmt' 2025-12-04T12:46:43.8683871Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:43.8704620Z Entering 'third_party/gloo' 2025-12-04T12:46:43.8725702Z Entering 'third_party/googletest' 2025-12-04T12:46:43.8745478Z Entering 'third_party/ideep' 2025-12-04T12:46:43.8764526Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:43.8787364Z Entering 'third_party/ittapi' 2025-12-04T12:46:43.8808631Z Entering 'third_party/kineto' 2025-12-04T12:46:43.8830943Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:43.8848839Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:43.8870671Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:43.8890015Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:43.8915138Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:43.8939394Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:43.8960615Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:43.8985132Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:43.9007830Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:43.9027805Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:43.9049238Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:43.9075857Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:43.9102258Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:43.9127233Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:43.9150967Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:43.9182086Z Entering 'third_party/kleidiai' 2025-12-04T12:46:43.9206446Z Entering 'third_party/mimalloc' 2025-12-04T12:46:43.9228237Z Entering 'third_party/nlohmann' 2025-12-04T12:46:43.9258471Z Entering 'third_party/onnx' 2025-12-04T12:46:43.9296789Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:43.9322694Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:43.9345583Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:43.9370514Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:43.9395168Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:43.9426805Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:43.9449572Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:43.9469047Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:43.9489058Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:43.9507674Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:43.9528424Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:43.9549482Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:43.9578551Z Entering 'third_party/pocketfft' 2025-12-04T12:46:43.9599255Z Entering 'third_party/protobuf' 2025-12-04T12:46:43.9621935Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:43.9641468Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:43.9673183Z Entering 'third_party/psimd' 2025-12-04T12:46:43.9694007Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:43.9714933Z Entering 'third_party/pybind11' 2025-12-04T12:46:43.9735950Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:43.9759858Z Entering 'third_party/sleef' 2025-12-04T12:46:43.9780760Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:43.9802227Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:43.9833513Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:43.9853484Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:43.9882838Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:43.9910806Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:43.9951795Z ##[endgroup] 2025-12-04T12:46:43.9952001Z ##[group]Persisting credentials for submodules 2025-12-04T12:46:43.9958797Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T12:46:44.0156499Z Entering 'android/libs/fbjni' 2025-12-04T12:46:44.0173087Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0173238Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0190696Z Entering 'third_party/FP16' 2025-12-04T12:46:44.0206918Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0207082Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0222897Z Entering 'third_party/FXdiv' 2025-12-04T12:46:44.0236313Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0236457Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0259035Z Entering 'third_party/NNPACK' 2025-12-04T12:46:44.0276137Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0276297Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0292486Z Entering 'third_party/NVTX' 2025-12-04T12:46:44.0304613Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0304812Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0320504Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:44.0334846Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0334990Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0356153Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:44.0370474Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0370621Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0392105Z Entering 'third_party/aiter' 2025-12-04T12:46:44.0409539Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0409696Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0426529Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:44.0441980Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0442135Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0463246Z Entering 'third_party/benchmark' 2025-12-04T12:46:44.0476381Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0476530Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0493257Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:44.0504955Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0505106Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0525019Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:44.0538615Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0538768Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0554129Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:44.0574918Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0575074Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0592789Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:44.0606439Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0606594Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0625151Z Entering 'third_party/cutlass' 2025-12-04T12:46:44.0638345Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0638493Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0659309Z Entering 'third_party/fbgemm' 2025-12-04T12:46:44.0685257Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0685444Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0710772Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:44.0729076Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0729238Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0756775Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:44.0778401Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0778719Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0810339Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:44.0833431Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0833783Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0859139Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:44.0883269Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0883480Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0910630Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:44.0935254Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0935416Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0960277Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:44.0975056Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0975367Z url.https://github.com/.insteadof 2025-12-04T12:46:44.0994165Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:44.1012102Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1012333Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1041879Z Entering 'third_party/flash-attention' 2025-12-04T12:46:44.1057978Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1058130Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1074224Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:44.1091136Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1091300Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1115195Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:44.1127888Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1128044Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1148332Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:44.1160309Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1160610Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1183202Z Entering 'third_party/fmt' 2025-12-04T12:46:44.1202641Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1203073Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1220721Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:44.1234313Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1234458Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1252340Z Entering 'third_party/gloo' 2025-12-04T12:46:44.1266024Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1266195Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1288084Z Entering 'third_party/googletest' 2025-12-04T12:46:44.1300195Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1300375Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1325413Z Entering 'third_party/ideep' 2025-12-04T12:46:44.1348091Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1348262Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1369228Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:44.1384969Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1385133Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1409710Z Entering 'third_party/ittapi' 2025-12-04T12:46:44.1425167Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1425332Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1442158Z Entering 'third_party/kineto' 2025-12-04T12:46:44.1455760Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1455918Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1475168Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:44.1489079Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1489238Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1508694Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:44.1525142Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1525555Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1548599Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:44.1562179Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1562344Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1579898Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:44.1591630Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1591933Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1607832Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:44.1618762Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1618921Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1635479Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:44.1649864Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1650009Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1671945Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:44.1683680Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1683834Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1706118Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:44.1718213Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1718373Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1736896Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:44.1755175Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1755329Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1771206Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:44.1785846Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1797669Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1802536Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:44.1815830Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1815972Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1840253Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.1860191Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1860692Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1883286Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.1899108Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1899254Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1923664Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:44.1938393Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1938553Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1960035Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:44.1975692Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1975851Z url.https://github.com/.insteadof 2025-12-04T12:46:44.1996663Z Entering 'third_party/kleidiai' 2025-12-04T12:46:44.2010679Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2010840Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2027087Z Entering 'third_party/mimalloc' 2025-12-04T12:46:44.2039867Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2040027Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2056077Z Entering 'third_party/nlohmann' 2025-12-04T12:46:44.2073841Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2074001Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2090975Z Entering 'third_party/onnx' 2025-12-04T12:46:44.2110058Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2110221Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2137847Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:44.2150361Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2150538Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2179497Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:44.2196393Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2196549Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2213740Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:44.2234346Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2234500Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2257055Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:44.2275880Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2276041Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2297458Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:44.2309794Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2309941Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2329497Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:44.2345021Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2345404Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2368376Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:44.2381310Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2381471Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2398795Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:44.2411572Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2411727Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2433106Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:44.2449582Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2449733Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2468596Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.2481982Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2482138Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2504260Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.2516706Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2516855Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2540123Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:44.2552954Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2553108Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2576765Z Entering 'third_party/pocketfft' 2025-12-04T12:46:44.2592437Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2592595Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2607714Z Entering 'third_party/protobuf' 2025-12-04T12:46:44.2621282Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2621574Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2640874Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:44.2656804Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2656968Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2672232Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:44.2683924Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2684086Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2705203Z Entering 'third_party/psimd' 2025-12-04T12:46:44.2721185Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2721345Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2741015Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:44.2754699Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2754859Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2773133Z Entering 'third_party/pybind11' 2025-12-04T12:46:44.2785943Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2786104Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2801804Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:44.2815076Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2815230Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2840069Z Entering 'third_party/sleef' 2025-12-04T12:46:44.2853178Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2853333Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2871282Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:44.2884526Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2884678Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2901129Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:44.2913256Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2913422Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2930201Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:44.2943548Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2943705Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2973371Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:44.2986110Z url.https://github.com/.insteadof 2025-12-04T12:46:44.2986265Z url.https://github.com/.insteadof 2025-12-04T12:46:44.3003301Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:44.3016515Z url.https://github.com/.insteadof 2025-12-04T12:46:44.3016677Z url.https://github.com/.insteadof 2025-12-04T12:46:44.3032553Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:44.3044386Z url.https://github.com/.insteadof 2025-12-04T12:46:44.3044542Z url.https://github.com/.insteadof 2025-12-04T12:46:44.3086933Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T12:46:44.3248997Z Entering 'android/libs/fbjni' 2025-12-04T12:46:44.3268643Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T12:46:44.3279964Z Entering 'third_party/FP16' 2025-12-04T12:46:44.3299623Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T12:46:44.3311267Z Entering 'third_party/FXdiv' 2025-12-04T12:46:44.3330046Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T12:46:44.3338603Z Entering 'third_party/NNPACK' 2025-12-04T12:46:44.3367689Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T12:46:44.3377660Z Entering 'third_party/NVTX' 2025-12-04T12:46:44.3397207Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T12:46:44.3406286Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:44.3427281Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T12:46:44.3436150Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:44.3454082Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T12:46:44.3468529Z Entering 'third_party/aiter' 2025-12-04T12:46:44.3487945Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T12:46:44.3498492Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:44.3521214Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T12:46:44.3536780Z Entering 'third_party/benchmark' 2025-12-04T12:46:44.3558773Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:44.3568109Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:44.3591092Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T12:46:44.3605116Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:44.3622977Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T12:46:44.3634407Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:44.3655254Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T12:46:44.3666502Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:44.3704094Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T12:46:44.3714925Z Entering 'third_party/cutlass' 2025-12-04T12:46:44.3734427Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T12:46:44.3749155Z Entering 'third_party/fbgemm' 2025-12-04T12:46:44.3767638Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T12:46:44.3779743Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:44.3812640Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T12:46:44.3822764Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:44.3844729Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T12:46:44.3857730Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:44.3884935Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T12:46:44.3894912Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:44.3923945Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T12:46:44.3940569Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:44.3963179Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T12:46:44.3972651Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:44.3996620Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T12:46:44.4005122Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:44.4024047Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T12:46:44.4035276Z Entering 'third_party/flash-attention' 2025-12-04T12:46:44.4061701Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T12:46:44.4071470Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:44.4099001Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T12:46:44.4111826Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:44.4143590Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T12:46:44.4159170Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:44.4179844Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T12:46:44.4193784Z Entering 'third_party/fmt' 2025-12-04T12:46:44.4211725Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:44.4226903Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:44.4255036Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T12:46:44.4264891Z Entering 'third_party/gloo' 2025-12-04T12:46:44.4284139Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T12:46:44.4293250Z Entering 'third_party/googletest' 2025-12-04T12:46:44.4311770Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.4320519Z Entering 'third_party/ideep' 2025-12-04T12:46:44.4342409Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T12:46:44.4355847Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:44.4383588Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T12:46:44.4396749Z Entering 'third_party/ittapi' 2025-12-04T12:46:44.4419755Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T12:46:44.4431645Z Entering 'third_party/kineto' 2025-12-04T12:46:44.4450759Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T12:46:44.4459448Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:44.4492796Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T12:46:44.4503017Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:44.4530386Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T12:46:44.4542261Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:44.4563722Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T12:46:44.4572388Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:44.4590748Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T12:46:44.4600229Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:44.4621098Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T12:46:44.4629487Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:44.4648071Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T12:46:44.4658123Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:44.4676051Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T12:46:44.4685195Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:44.4705621Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.4714127Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:44.4732836Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T12:46:44.4741602Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:44.4763840Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T12:46:44.4776619Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:44.4796366Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:44.4809225Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.4830137Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:44.4841456Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.4864876Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:44.4876922Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:44.4904199Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T12:46:44.4913316Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:44.4933515Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.4944038Z Entering 'third_party/kleidiai' 2025-12-04T12:46:44.4963107Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T12:46:44.4973346Z Entering 'third_party/mimalloc' 2025-12-04T12:46:44.4995953Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T12:46:44.5005382Z Entering 'third_party/nlohmann' 2025-12-04T12:46:44.5024095Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T12:46:44.5033867Z Entering 'third_party/onnx' 2025-12-04T12:46:44.5050489Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T12:46:44.5067618Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:44.5091623Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:44.5104004Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:44.5125308Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T12:46:44.5134760Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:44.5153290Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:44.5162333Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:44.5182100Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.5189998Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:44.5211069Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T12:46:44.5220017Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:44.5244656Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T12:46:44.5258511Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:44.5284927Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T12:46:44.5294316Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:44.5317233Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T12:46:44.5325782Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:44.5342804Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T12:46:44.5350889Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.5376815Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T12:46:44.5390838Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.5413784Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T12:46:44.5426083Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:44.5446463Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T12:46:44.5463204Z Entering 'third_party/pocketfft' 2025-12-04T12:46:44.5488536Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T12:46:44.5497986Z Entering 'third_party/protobuf' 2025-12-04T12:46:44.5527796Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T12:46:44.5544968Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:44.5576023Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T12:46:44.5596814Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:44.5633386Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.5652239Z Entering 'third_party/psimd' 2025-12-04T12:46:44.5682399Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T12:46:44.5691791Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:44.5719341Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T12:46:44.5729619Z Entering 'third_party/pybind11' 2025-12-04T12:46:44.5753771Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:44.5764694Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:44.5787360Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T12:46:44.5801420Z Entering 'third_party/sleef' 2025-12-04T12:46:44.5823760Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T12:46:44.5836170Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:44.5860014Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T12:46:44.5867789Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:44.5900919Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T12:46:44.5911804Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:44.5931981Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T12:46:44.5940882Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:44.5968786Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T12:46:44.5980979Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:44.6004014Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T12:46:44.6018756Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:44.6039608Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T12:46:44.6313636Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T12:46:44.6500076Z Entering 'android/libs/fbjni' 2025-12-04T12:46:44.6534653Z Entering 'third_party/FP16' 2025-12-04T12:46:44.6567449Z Entering 'third_party/FXdiv' 2025-12-04T12:46:44.6596578Z Entering 'third_party/NNPACK' 2025-12-04T12:46:44.6617708Z Entering 'third_party/NVTX' 2025-12-04T12:46:44.6643194Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:44.6673401Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:44.6702476Z Entering 'third_party/aiter' 2025-12-04T12:46:44.6732089Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:44.6763006Z Entering 'third_party/benchmark' 2025-12-04T12:46:44.6786134Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:44.6812246Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:44.6838995Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:44.6860271Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:44.6885669Z Entering 'third_party/cutlass' 2025-12-04T12:46:44.6914918Z Entering 'third_party/fbgemm' 2025-12-04T12:46:44.6951219Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:44.6971879Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:44.7000996Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:44.7026322Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:44.7052996Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:44.7076048Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:44.7101684Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:44.7126330Z Entering 'third_party/flash-attention' 2025-12-04T12:46:44.7149721Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:44.7178515Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:44.7203656Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:44.7232014Z Entering 'third_party/fmt' 2025-12-04T12:46:44.7254099Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:44.7275203Z Entering 'third_party/gloo' 2025-12-04T12:46:44.7297069Z Entering 'third_party/googletest' 2025-12-04T12:46:44.7326781Z Entering 'third_party/ideep' 2025-12-04T12:46:44.7346548Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:44.7374479Z Entering 'third_party/ittapi' 2025-12-04T12:46:44.7395600Z Entering 'third_party/kineto' 2025-12-04T12:46:44.7419635Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:44.7439918Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:44.7460900Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:44.7482152Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:44.7501431Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:44.7520243Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:44.7551022Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:44.7573512Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:44.7602563Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:44.7624183Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:44.7643190Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:44.7663182Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.7683260Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.7712837Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:44.7736552Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:44.7769454Z Entering 'third_party/kleidiai' 2025-12-04T12:46:44.7790608Z Entering 'third_party/mimalloc' 2025-12-04T12:46:44.7817852Z Entering 'third_party/nlohmann' 2025-12-04T12:46:44.7841865Z Entering 'third_party/onnx' 2025-12-04T12:46:44.7870730Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:44.7907654Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:44.7934688Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:44.7963294Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:44.7990372Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:44.8015104Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:44.8034465Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:44.8062333Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:44.8083711Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:44.8104950Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.8129476Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.8154891Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:44.8195591Z Entering 'third_party/pocketfft' 2025-12-04T12:46:44.8217833Z Entering 'third_party/protobuf' 2025-12-04T12:46:44.8242079Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:44.8268710Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:44.8300717Z Entering 'third_party/psimd' 2025-12-04T12:46:44.8329185Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:44.8351532Z Entering 'third_party/pybind11' 2025-12-04T12:46:44.8387922Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:44.8414038Z Entering 'third_party/sleef' 2025-12-04T12:46:44.8435963Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:44.8460227Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:44.8487229Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:44.8508936Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:44.8528806Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:44.8551890Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:44.8599054Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T12:46:44.8766635Z Entering 'android/libs/fbjni' 2025-12-04T12:46:44.8797043Z Entering 'third_party/FP16' 2025-12-04T12:46:44.8817778Z Entering 'third_party/FXdiv' 2025-12-04T12:46:44.8838051Z Entering 'third_party/NNPACK' 2025-12-04T12:46:44.8858550Z Entering 'third_party/NVTX' 2025-12-04T12:46:44.8879534Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T12:46:44.8903196Z Entering 'third_party/XNNPACK' 2025-12-04T12:46:44.8928576Z Entering 'third_party/aiter' 2025-12-04T12:46:44.8955470Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T12:46:44.8985811Z Entering 'third_party/benchmark' 2025-12-04T12:46:44.9007518Z Entering 'third_party/composable_kernel' 2025-12-04T12:46:44.9031872Z Entering 'third_party/cpp-httplib' 2025-12-04T12:46:44.9055598Z Entering 'third_party/cpuinfo' 2025-12-04T12:46:44.9075947Z Entering 'third_party/cudnn_frontend' 2025-12-04T12:46:44.9103821Z Entering 'third_party/cutlass' 2025-12-04T12:46:44.9129048Z Entering 'third_party/fbgemm' 2025-12-04T12:46:44.9153327Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T12:46:44.9184536Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T12:46:44.9208286Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T12:46:44.9231719Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T12:46:44.9265710Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T12:46:44.9293168Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T12:46:44.9317014Z Entering 'third_party/fbgemm/external/json' 2025-12-04T12:46:44.9345338Z Entering 'third_party/flash-attention' 2025-12-04T12:46:44.9367837Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T12:46:44.9399998Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T12:46:44.9425564Z Entering 'third_party/flatbuffers' 2025-12-04T12:46:44.9446396Z Entering 'third_party/fmt' 2025-12-04T12:46:44.9467540Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T12:46:44.9487904Z Entering 'third_party/gloo' 2025-12-04T12:46:44.9509095Z Entering 'third_party/googletest' 2025-12-04T12:46:44.9529814Z Entering 'third_party/ideep' 2025-12-04T12:46:44.9549783Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T12:46:44.9587119Z Entering 'third_party/ittapi' 2025-12-04T12:46:44.9614470Z Entering 'third_party/kineto' 2025-12-04T12:46:44.9643583Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T12:46:44.9670222Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T12:46:44.9691034Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T12:46:44.9710519Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T12:46:44.9728784Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T12:46:44.9757616Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T12:46:44.9781287Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T12:46:44.9804630Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T12:46:44.9826538Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T12:46:44.9849480Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T12:46:44.9871193Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T12:46:44.9892599Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:44.9917773Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:44.9952742Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T12:46:44.9977533Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T12:46:45.0001423Z Entering 'third_party/kleidiai' 2025-12-04T12:46:45.0022998Z Entering 'third_party/mimalloc' 2025-12-04T12:46:45.0045787Z Entering 'third_party/nlohmann' 2025-12-04T12:46:45.0067145Z Entering 'third_party/onnx' 2025-12-04T12:46:45.0096746Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T12:46:45.0119958Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T12:46:45.0148369Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T12:46:45.0174681Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T12:46:45.0200834Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T12:46:45.0227377Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T12:46:45.0247141Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T12:46:45.0267646Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T12:46:45.0293406Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T12:46:45.0315315Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T12:46:45.0337228Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T12:46:45.0357988Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T12:46:45.0386249Z Entering 'third_party/pocketfft' 2025-12-04T12:46:45.0407619Z Entering 'third_party/protobuf' 2025-12-04T12:46:45.0427800Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T12:46:45.0446011Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T12:46:45.0471118Z Entering 'third_party/psimd' 2025-12-04T12:46:45.0492219Z Entering 'third_party/pthreadpool' 2025-12-04T12:46:45.0513367Z Entering 'third_party/pybind11' 2025-12-04T12:46:45.0536629Z Entering 'third_party/python-peachpy' 2025-12-04T12:46:45.0556121Z Entering 'third_party/sleef' 2025-12-04T12:46:45.0576536Z Entering 'third_party/tensorpipe' 2025-12-04T12:46:45.0597314Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T12:46:45.0621655Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T12:46:45.0649288Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T12:46:45.0670879Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T12:46:45.0692898Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T12:46:45.0740918Z ##[endgroup] 2025-12-04T12:46:45.0965974Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T12:46:45.1134996Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:45.1283076Z Prepare all required actions 2025-12-04T12:46:45.1283329Z Getting action download info 2025-12-04T12:46:45.3539072Z Download action repository 'aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076' (SHA:062b18b96a7aff071d4dc91bc00c4c1a7945b076) 2025-12-04T12:46:46.1625175Z ##[group]Run ./.github/actions/setup-rocm 2025-12-04T12:46:46.1625322Z env: 2025-12-04T12:46:46.1625410Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.1625511Z ##[endgroup] 2025-12-04T12:46:46.1638168Z ##[group]Run dpkg -l | grep -E " rocm" 2025-12-04T12:46:46.1638311Z dpkg -l | grep -E " rocm" 2025-12-04T12:46:46.1643131Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.1643269Z env: 2025-12-04T12:46:46.1643353Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.1643467Z ##[endgroup] 2025-12-04T12:46:46.1709520Z ii rocm-cmake 0.14.0.60401-83~22.04 amd64 rocm-cmake built using CMake 2025-12-04T12:46:46.1709959Z ii rocm-core 6.4.1.60401-83~22.04 amd64 ROCm Runtime software stack 2025-12-04T12:46:46.1710352Z ii rocm-dbgapi 0.77.2.60401-83~22.04 amd64 Library to provide AMD GPU debugger API 2025-12-04T12:46:46.1710788Z ii rocm-debug-agent 2.0.4.60401-83~22.04 amd64 Radeon Open Compute Debug Agent (ROCdebug-agent) 2025-12-04T12:46:46.1711228Z ii rocm-dev 6.4.1.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime software stack 2025-12-04T12:46:46.1711643Z ii rocm-device-libs 1.0.0.60401-83~22.04 amd64 Radeon Open Compute - device libraries 2025-12-04T12:46:46.1712009Z ii rocm-gdb 15.2.60401-83~22.04 amd64 ROCgdb 2025-12-04T12:46:46.1712717Z ii rocm-llvm 19.0.0.25184.60401-83~22.04 amd64 ROCm core compiler 2025-12-04T12:46:46.1713093Z ii rocm-opencl 2.0.0.60401-83~22.04 amd64 clr built using CMake 2025-12-04T12:46:46.1713474Z ii rocm-opencl-dev 2.0.0.60401-83~22.04 amd64 clr built using CMake 2025-12-04T12:46:46.1713852Z ii rocm-smi-lib 7.5.0.60401-83~22.04 amd64 AMD System Management libraries 2025-12-04T12:46:46.1714254Z ii rocm-utils 6.4.1.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime software stack 2025-12-04T12:46:46.1714669Z ii rocminfo 1.0.0.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool 2025-12-04T12:46:46.1726733Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T12:46:46.1727108Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T12:46:46.1727301Z # shellcheck disable=SC2046 2025-12-04T12:46:46.1727451Z docker stop $(docker ps -q) || true 2025-12-04T12:46:46.1727667Z # Prune all stopped containers. 2025-12-04T12:46:46.1727816Z docker container prune -f 2025-12-04T12:46:46.1731070Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.1731235Z env: 2025-12-04T12:46:46.1731343Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.1731462Z ##[endgroup] 2025-12-04T12:46:46.1921351Z docker: 'docker stop' requires at least 1 argument 2025-12-04T12:46:46.1921608Z 2025-12-04T12:46:46.1921733Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2025-12-04T12:46:46.1921911Z 2025-12-04T12:46:46.1922028Z See 'docker stop --help' for more information 2025-12-04T12:46:46.2038472Z Total reclaimed space: 0B 2025-12-04T12:46:46.2067569Z ##[group]Run cat /etc/os-release || true 2025-12-04T12:46:46.2067783Z cat /etc/os-release || true 2025-12-04T12:46:46.2067926Z cat /etc/apt/sources.list.d/rocm.list || true 2025-12-04T12:46:46.2068240Z cat /opt/rocm/.info/version || true 2025-12-04T12:46:46.2068361Z whoami 2025-12-04T12:46:46.2073029Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.2073175Z env: 2025-12-04T12:46:46.2073270Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.2073372Z ##[endgroup] 2025-12-04T12:46:46.2096238Z PRETTY_NAME="Ubuntu 22.04.5 LTS" 2025-12-04T12:46:46.2096355Z NAME="Ubuntu" 2025-12-04T12:46:46.2096460Z VERSION_ID="22.04" 2025-12-04T12:46:46.2096565Z VERSION="22.04.5 LTS (Jammy Jellyfish)" 2025-12-04T12:46:46.2096690Z VERSION_CODENAME=jammy 2025-12-04T12:46:46.2096783Z ID=ubuntu 2025-12-04T12:46:46.2096867Z ID_LIKE=debian 2025-12-04T12:46:46.2096989Z HOME_URL="https://www.ubuntu.com/" 2025-12-04T12:46:46.2097134Z SUPPORT_URL="https://help.ubuntu.com/" 2025-12-04T12:46:46.2097293Z BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" 2025-12-04T12:46:46.2097547Z PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" 2025-12-04T12:46:46.2097739Z UBUNTU_CODENAME=jammy 2025-12-04T12:46:46.2101933Z deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.4.1 jammy main 2025-12-04T12:46:46.2106530Z 6.4.1-83 2025-12-04T12:46:46.2112986Z runner 2025-12-04T12:46:46.2130630Z ##[group]Run dpkg -l | grep -E " amdgpu" 2025-12-04T12:46:46.2130805Z dpkg -l | grep -E " amdgpu" 2025-12-04T12:46:46.2136143Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.2136301Z env: 2025-12-04T12:46:46.2136398Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.2136515Z ##[endgroup] 2025-12-04T12:46:46.2192652Z ii amdgpu-core 1:6.4.60401-2164967.22.04 all Core meta package for unified amdgpu driver. 2025-12-04T12:46:46.2192935Z ii amdgpu-install 6.4.60401-2164967.22.04 all AMDGPU driver repository and installer 2025-12-04T12:46:46.2211209Z ##[group]Run rocm-smi 2025-12-04T12:46:46.2211370Z rocm-smi 2025-12-04T12:46:46.2216159Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.2216311Z env: 2025-12-04T12:46:46.2216401Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.2216508Z ##[endgroup] 2025-12-04T12:46:46.2841797Z 2025-12-04T12:46:46.2841835Z 2025-12-04T12:46:46.2842050Z ============================================ ROCm System Management Interface ============================================ 2025-12-04T12:46:46.2842267Z ====================================================== Concise Info ====================================================== 2025-12-04T12:46:46.2842514Z Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2025-12-04T12:46:46.2843071Z  (DID, GUID) (Junction) (Socket) (Mem, Compute, ID)  2025-12-04T12:46:46.2843288Z ========================================================================================================================== 2025-12-04T12:46:46.2843772Z 0 3 0x74a5, 51110 28.0°C 122.0W NPS1, SPX, 0 N/A 900Mhz 0% manual 1000.0W 0% 0% 2025-12-04T12:46:46.2844035Z 1 5 0x74a5, 2987 28.0°C 121.0W NPS1, SPX, 0 N/A 900Mhz 0% manual 1000.0W 0% 0% 2025-12-04T12:46:46.2844292Z 2 4 0x74a5, 61326 28.0°C 117.0W NPS1, SPX, 0 N/A 900Mhz 0% manual 1000.0W 0% 0% 2025-12-04T12:46:46.2844553Z 3 2 0x74a5, 9091 28.0°C 113.0W NPS1, SPX, 0 N/A 900Mhz 0% manual 1000.0W 0% 0% 2025-12-04T12:46:46.2844914Z ========================================================================================================================== 2025-12-04T12:46:46.2845441Z ================================================== End of ROCm SMI Log =================================================== 2025-12-04T12:46:46.2901777Z ##[group]Run rocminfo 2025-12-04T12:46:46.2901918Z rocminfo 2025-12-04T12:46:46.2906470Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.2906624Z env: 2025-12-04T12:46:46.2906712Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.2906814Z ##[endgroup] 2025-12-04T12:46:46.3819120Z ROCk module version 6.12.12 is loaded 2025-12-04T12:46:46.3819265Z ===================== 2025-12-04T12:46:46.3819376Z HSA System Attributes 2025-12-04T12:46:46.3819482Z ===================== 2025-12-04T12:46:46.3819595Z Runtime Version: 1.15 2025-12-04T12:46:46.3819710Z Runtime Ext Version: 1.7 2025-12-04T12:46:46.3819835Z System Timestamp Freq.: 1000.000000MHz 2025-12-04T12:46:46.3820026Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2025-12-04T12:46:46.3820233Z Machine Model: LARGE 2025-12-04T12:46:46.3820420Z System Endianness: LITTLE 2025-12-04T12:46:46.3820566Z Mwaitx: DISABLED 2025-12-04T12:46:46.3820767Z XNACK enabled: NO 2025-12-04T12:46:46.3820884Z DMAbuf Support: YES 2025-12-04T12:46:46.3821046Z VMM Support: YES 2025-12-04T12:46:46.3821126Z 2025-12-04T12:46:46.3821297Z ========== 2025-12-04T12:46:46.3821477Z HSA Agents 2025-12-04T12:46:46.3821575Z ========== 2025-12-04T12:46:46.3821758Z ******* 2025-12-04T12:46:46.3821851Z Agent 1 2025-12-04T12:46:46.3821947Z ******* 2025-12-04T12:46:46.3822080Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:46:46.3822248Z Uuid: CPU-XX 2025-12-04T12:46:46.3822399Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:46:46.3822550Z Vendor Name: CPU 2025-12-04T12:46:46.3822848Z Feature: None specified 2025-12-04T12:46:46.3822997Z Profile: FULL_PROFILE 2025-12-04T12:46:46.3823144Z Float Round Mode: NEAR 2025-12-04T12:46:46.3823293Z Max Queue Number: 0(0x0) 2025-12-04T12:46:46.3823452Z Queue Min Size: 0(0x0) 2025-12-04T12:46:46.3823669Z Queue Max Size: 0(0x0) 2025-12-04T12:46:46.3823819Z Queue Type: MULTI 2025-12-04T12:46:46.3823963Z Node: 0 2025-12-04T12:46:46.3824104Z Device Type: CPU 2025-12-04T12:46:46.3824239Z Cache Info: 2025-12-04T12:46:46.3824353Z L1: 49152(0xc000) KB 2025-12-04T12:46:46.3824494Z Chip ID: 0(0x0) 2025-12-04T12:46:46.3824637Z ASIC Revision: 0(0x0) 2025-12-04T12:46:46.3824784Z Cacheline Size: 64(0x40) 2025-12-04T12:46:46.3824933Z Max Clock Freq. (MHz): 3300 2025-12-04T12:46:46.3825074Z BDFID: 0 2025-12-04T12:46:46.3825213Z Internal Node ID: 0 2025-12-04T12:46:46.3825359Z Compute Unit: 64 2025-12-04T12:46:46.3825497Z SIMDs per CU: 0 2025-12-04T12:46:46.3825640Z Shader Engines: 0 2025-12-04T12:46:46.3825790Z Shader Arrs. per Eng.: 0 2025-12-04T12:46:46.3825965Z WatchPts on Addr. Ranges:1 2025-12-04T12:46:46.3826109Z Memory Properties: 2025-12-04T12:46:46.3826212Z Features: None 2025-12-04T12:46:46.3826316Z Pool Info: 2025-12-04T12:46:46.3826496Z Pool 1 2025-12-04T12:46:46.3826625Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3826774Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:46:46.3826917Z Allocatable: TRUE 2025-12-04T12:46:46.3827069Z Alloc Granule: 4KB 2025-12-04T12:46:46.3827227Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3827383Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3827600Z Accessible by all: TRUE 2025-12-04T12:46:46.3827731Z Pool 2 2025-12-04T12:46:46.3827859Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3828010Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:46:46.3828151Z Allocatable: TRUE 2025-12-04T12:46:46.3828301Z Alloc Granule: 4KB 2025-12-04T12:46:46.3828457Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3828611Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3828763Z Accessible by all: TRUE 2025-12-04T12:46:46.3828894Z Pool 3 2025-12-04T12:46:46.3829014Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T12:46:46.3829155Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:46:46.3829292Z Allocatable: TRUE 2025-12-04T12:46:46.3829440Z Alloc Granule: 4KB 2025-12-04T12:46:46.3829639Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3829795Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3829948Z Accessible by all: TRUE 2025-12-04T12:46:46.3830079Z Pool 4 2025-12-04T12:46:46.3830199Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3830340Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:46:46.3830478Z Allocatable: TRUE 2025-12-04T12:46:46.3830630Z Alloc Granule: 4KB 2025-12-04T12:46:46.3830785Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3830940Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3831093Z Accessible by all: TRUE 2025-12-04T12:46:46.3831230Z ISA Info: 2025-12-04T12:46:46.3831325Z ******* 2025-12-04T12:46:46.3831417Z Agent 2 2025-12-04T12:46:46.3831509Z ******* 2025-12-04T12:46:46.3831620Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:46:46.3831758Z Uuid: CPU-XX 2025-12-04T12:46:46.3831903Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:46:46.3832055Z Vendor Name: CPU 2025-12-04T12:46:46.3832198Z Feature: None specified 2025-12-04T12:46:46.3832339Z Profile: FULL_PROFILE 2025-12-04T12:46:46.3832486Z Float Round Mode: NEAR 2025-12-04T12:46:46.3832631Z Max Queue Number: 0(0x0) 2025-12-04T12:46:46.3832779Z Queue Min Size: 0(0x0) 2025-12-04T12:46:46.3832922Z Queue Max Size: 0(0x0) 2025-12-04T12:46:46.3833106Z Queue Type: MULTI 2025-12-04T12:46:46.3833244Z Node: 1 2025-12-04T12:46:46.3833382Z Device Type: CPU 2025-12-04T12:46:46.3833507Z Cache Info: 2025-12-04T12:46:46.3833620Z L1: 49152(0xc000) KB 2025-12-04T12:46:46.3833751Z Chip ID: 0(0x0) 2025-12-04T12:46:46.3833891Z ASIC Revision: 0(0x0) 2025-12-04T12:46:46.3834037Z Cacheline Size: 64(0x40) 2025-12-04T12:46:46.3834181Z Max Clock Freq. (MHz): 3300 2025-12-04T12:46:46.3834321Z BDFID: 0 2025-12-04T12:46:46.3834464Z Internal Node ID: 1 2025-12-04T12:46:46.3834606Z Compute Unit: 64 2025-12-04T12:46:46.3834749Z SIMDs per CU: 0 2025-12-04T12:46:46.3834894Z Shader Engines: 0 2025-12-04T12:46:46.3835040Z Shader Arrs. per Eng.: 0 2025-12-04T12:46:46.3835193Z WatchPts on Addr. Ranges:1 2025-12-04T12:46:46.3835325Z Memory Properties: 2025-12-04T12:46:46.3835428Z Features: None 2025-12-04T12:46:46.3835531Z Pool Info: 2025-12-04T12:46:46.3835628Z Pool 1 2025-12-04T12:46:46.3835751Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3835893Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:46:46.3836067Z Allocatable: TRUE 2025-12-04T12:46:46.3836219Z Alloc Granule: 4KB 2025-12-04T12:46:46.3836376Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3836533Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3836686Z Accessible by all: TRUE 2025-12-04T12:46:46.3836815Z Pool 2 2025-12-04T12:46:46.3836945Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3837087Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:46:46.3837226Z Allocatable: TRUE 2025-12-04T12:46:46.3837375Z Alloc Granule: 4KB 2025-12-04T12:46:46.3837563Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3837727Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3837882Z Accessible by all: TRUE 2025-12-04T12:46:46.3838011Z Pool 3 2025-12-04T12:46:46.3838135Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T12:46:46.3838280Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:46:46.3838418Z Allocatable: TRUE 2025-12-04T12:46:46.3838567Z Alloc Granule: 4KB 2025-12-04T12:46:46.3838719Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3838875Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3839027Z Accessible by all: TRUE 2025-12-04T12:46:46.3839155Z Pool 4 2025-12-04T12:46:46.3839276Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3839424Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:46:46.3839597Z Allocatable: TRUE 2025-12-04T12:46:46.3839749Z Alloc Granule: 4KB 2025-12-04T12:46:46.3839904Z Alloc Recommended Granule:4KB 2025-12-04T12:46:46.3840058Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3840211Z Accessible by all: TRUE 2025-12-04T12:46:46.3840341Z ISA Info: 2025-12-04T12:46:46.3840442Z ******* 2025-12-04T12:46:46.3840535Z Agent 3 2025-12-04T12:46:46.3840625Z ******* 2025-12-04T12:46:46.3840731Z Name: gfx942 2025-12-04T12:46:46.3840866Z Uuid: GPU-70f7a8c60f3fb761 2025-12-04T12:46:46.3841016Z Marketing Name: AMD Instinct MI325X 2025-12-04T12:46:46.3841167Z Vendor Name: AMD 2025-12-04T12:46:46.3841315Z Feature: KERNEL_DISPATCH 2025-12-04T12:46:46.3841460Z Profile: BASE_PROFILE 2025-12-04T12:46:46.3841607Z Float Round Mode: NEAR 2025-12-04T12:46:46.3841752Z Max Queue Number: 128(0x80) 2025-12-04T12:46:46.3841897Z Queue Min Size: 64(0x40) 2025-12-04T12:46:46.3842040Z Queue Max Size: 131072(0x20000) 2025-12-04T12:46:46.3842182Z Queue Type: MULTI 2025-12-04T12:46:46.3842318Z Node: 2 2025-12-04T12:46:46.3842451Z Device Type: GPU 2025-12-04T12:46:46.3842612Z Cache Info: 2025-12-04T12:46:46.3842723Z L1: 32(0x20) KB 2025-12-04T12:46:46.3842854Z L2: 4096(0x1000) KB 2025-12-04T12:46:46.3842978Z L3: 262144(0x40000) KB 2025-12-04T12:46:46.3843108Z Chip ID: 29861(0x74a5) 2025-12-04T12:46:46.3843247Z ASIC Revision: 1(0x1) 2025-12-04T12:46:46.3843394Z Cacheline Size: 128(0x80) 2025-12-04T12:46:46.3843540Z Max Clock Freq. (MHz): 2100 2025-12-04T12:46:46.3843679Z BDFID: 29952 2025-12-04T12:46:46.3843821Z Internal Node ID: 2 2025-12-04T12:46:46.3843963Z Compute Unit: 304 2025-12-04T12:46:46.3844104Z SIMDs per CU: 4 2025-12-04T12:46:46.3844252Z Shader Engines: 32 2025-12-04T12:46:46.3844400Z Shader Arrs. per Eng.: 1 2025-12-04T12:46:46.3844555Z WatchPts on Addr. Ranges:4 2025-12-04T12:46:46.3844707Z Coherent Host Access: FALSE 2025-12-04T12:46:46.3844841Z Memory Properties: 2025-12-04T12:46:46.3844956Z Features: KERNEL_DISPATCH 2025-12-04T12:46:46.3845092Z Fast F16 Operation: TRUE 2025-12-04T12:46:46.3845242Z Wavefront Size: 64(0x40) 2025-12-04T12:46:46.3845393Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3845528Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3845648Z x 1024(0x400) 2025-12-04T12:46:46.3845769Z y 1024(0x400) 2025-12-04T12:46:46.3845892Z z 1024(0x400) 2025-12-04T12:46:46.3846028Z Max Waves Per CU: 32(0x20) 2025-12-04T12:46:46.3846214Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:46:46.3846365Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3846499Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3846613Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3846740Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3846864Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3847004Z Max fbarriers/Workgrp: 32 2025-12-04T12:46:46.3852289Z Packet Processor uCode:: 185 2025-12-04T12:46:46.3852456Z SDMA engine uCode:: 24 2025-12-04T12:46:46.3852614Z IOMMU Support:: None 2025-12-04T12:46:46.3852748Z Pool Info: 2025-12-04T12:46:46.3852849Z Pool 1 2025-12-04T12:46:46.3852986Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3853138Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3853283Z Allocatable: TRUE 2025-12-04T12:46:46.3853438Z Alloc Granule: 4KB 2025-12-04T12:46:46.3853598Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3853759Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3853916Z Accessible by all: FALSE 2025-12-04T12:46:46.3854050Z Pool 2 2025-12-04T12:46:46.3854177Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3854406Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3854549Z Allocatable: TRUE 2025-12-04T12:46:46.3854704Z Alloc Granule: 4KB 2025-12-04T12:46:46.3854862Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3855018Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3855177Z Accessible by all: FALSE 2025-12-04T12:46:46.3855311Z Pool 3 2025-12-04T12:46:46.3855436Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3855581Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3855724Z Allocatable: TRUE 2025-12-04T12:46:46.3855877Z Alloc Granule: 4KB 2025-12-04T12:46:46.3856039Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3856194Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3856349Z Accessible by all: FALSE 2025-12-04T12:46:46.3856481Z Pool 4 2025-12-04T12:46:46.3856601Z Segment: GROUP 2025-12-04T12:46:46.3856741Z Size: 64(0x40) KB 2025-12-04T12:46:46.3856881Z Allocatable: FALSE 2025-12-04T12:46:46.3857034Z Alloc Granule: 0KB 2025-12-04T12:46:46.3857192Z Alloc Recommended Granule:0KB 2025-12-04T12:46:46.3857348Z Alloc Alignment: 0KB 2025-12-04T12:46:46.3857552Z Accessible by all: FALSE 2025-12-04T12:46:46.3857689Z ISA Info: 2025-12-04T12:46:46.3857790Z ISA 1 2025-12-04T12:46:46.3857964Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:46:46.3858127Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3858285Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3858442Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3858602Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3872448Z Fast f16: TRUE 2025-12-04T12:46:46.3872691Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3872944Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3873097Z x 1024(0x400) 2025-12-04T12:46:46.3873235Z y 1024(0x400) 2025-12-04T12:46:46.3873419Z z 1024(0x400) 2025-12-04T12:46:46.3873601Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3873754Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3873904Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3874038Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3874235Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3874497Z FBarrier Max Size: 32 2025-12-04T12:46:46.3874635Z ISA 2 2025-12-04T12:46:46.3874788Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:46:46.3875012Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3875177Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3875466Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3875634Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3875847Z Fast f16: TRUE 2025-12-04T12:46:46.3876109Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3876268Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3876413Z x 1024(0x400) 2025-12-04T12:46:46.3876546Z y 1024(0x400) 2025-12-04T12:46:46.3876674Z z 1024(0x400) 2025-12-04T12:46:46.3876816Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3876968Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3877096Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3877242Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3877377Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3877570Z FBarrier Max Size: 32 2025-12-04T12:46:46.3877710Z ******* 2025-12-04T12:46:46.3877809Z Agent 4 2025-12-04T12:46:46.3877912Z ******* 2025-12-04T12:46:46.3878024Z Name: gfx942 2025-12-04T12:46:46.3878171Z Uuid: GPU-e6f4b2936c68b43f 2025-12-04T12:46:46.3878329Z Marketing Name: AMD Instinct MI325X 2025-12-04T12:46:46.3878486Z Vendor Name: AMD 2025-12-04T12:46:46.3878642Z Feature: KERNEL_DISPATCH 2025-12-04T12:46:46.3878802Z Profile: BASE_PROFILE 2025-12-04T12:46:46.3878956Z Float Round Mode: NEAR 2025-12-04T12:46:46.3879112Z Max Queue Number: 128(0x80) 2025-12-04T12:46:46.3879305Z Queue Min Size: 64(0x40) 2025-12-04T12:46:46.3879460Z Queue Max Size: 131072(0x20000) 2025-12-04T12:46:46.3879613Z Queue Type: MULTI 2025-12-04T12:46:46.3879757Z Node: 3 2025-12-04T12:46:46.3879904Z Device Type: GPU 2025-12-04T12:46:46.3880046Z Cache Info: 2025-12-04T12:46:46.3880167Z L1: 32(0x20) KB 2025-12-04T12:46:46.3880312Z L2: 4096(0x1000) KB 2025-12-04T12:46:46.3880447Z L3: 262144(0x40000) KB 2025-12-04T12:46:46.3880580Z Chip ID: 29861(0x74a5) 2025-12-04T12:46:46.3880739Z ASIC Revision: 1(0x1) 2025-12-04T12:46:46.3880897Z Cacheline Size: 128(0x80) 2025-12-04T12:46:46.3881057Z Max Clock Freq. (MHz): 2100 2025-12-04T12:46:46.3881206Z BDFID: 1280 2025-12-04T12:46:46.3881350Z Internal Node ID: 3 2025-12-04T12:46:46.3881659Z Compute Unit: 304 2025-12-04T12:46:46.3881817Z SIMDs per CU: 4 2025-12-04T12:46:46.3881970Z Shader Engines: 32 2025-12-04T12:46:46.3882135Z Shader Arrs. per Eng.: 1 2025-12-04T12:46:46.3882298Z WatchPts on Addr. Ranges:4 2025-12-04T12:46:46.3882467Z Coherent Host Access: FALSE 2025-12-04T12:46:46.3882652Z Memory Properties: 2025-12-04T12:46:46.3882768Z Features: KERNEL_DISPATCH 2025-12-04T12:46:46.3882917Z Fast F16 Operation: TRUE 2025-12-04T12:46:46.3883075Z Wavefront Size: 64(0x40) 2025-12-04T12:46:46.3883230Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3883372Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3883498Z x 1024(0x400) 2025-12-04T12:46:46.3883635Z y 1024(0x400) 2025-12-04T12:46:46.3883774Z z 1024(0x400) 2025-12-04T12:46:46.3883914Z Max Waves Per CU: 32(0x20) 2025-12-04T12:46:46.3884077Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:46:46.3884241Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3884383Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3884511Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3884650Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3884781Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3884938Z Max fbarriers/Workgrp: 32 2025-12-04T12:46:46.3885113Z Packet Processor uCode:: 185 2025-12-04T12:46:46.3885276Z SDMA engine uCode:: 24 2025-12-04T12:46:46.3885433Z IOMMU Support:: None 2025-12-04T12:46:46.3885567Z Pool Info: 2025-12-04T12:46:46.3885678Z Pool 1 2025-12-04T12:46:46.3885813Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3885965Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3886121Z Allocatable: TRUE 2025-12-04T12:46:46.3886275Z Alloc Granule: 4KB 2025-12-04T12:46:46.3886476Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3886637Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3886790Z Accessible by all: FALSE 2025-12-04T12:46:46.3886926Z Pool 2 2025-12-04T12:46:46.3887061Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3887207Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3887352Z Allocatable: TRUE 2025-12-04T12:46:46.3887542Z Alloc Granule: 4KB 2025-12-04T12:46:46.3887697Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3887861Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3888012Z Accessible by all: FALSE 2025-12-04T12:46:46.3888160Z Pool 3 2025-12-04T12:46:46.3888289Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3888428Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3888572Z Allocatable: TRUE 2025-12-04T12:46:46.3888724Z Alloc Granule: 4KB 2025-12-04T12:46:46.3888880Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3889043Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3889194Z Accessible by all: FALSE 2025-12-04T12:46:46.3889333Z Pool 4 2025-12-04T12:46:46.3889495Z Segment: GROUP 2025-12-04T12:46:46.3889631Z Size: 64(0x40) KB 2025-12-04T12:46:46.3889778Z Allocatable: FALSE 2025-12-04T12:46:46.3889928Z Alloc Granule: 0KB 2025-12-04T12:46:46.3890085Z Alloc Recommended Granule:0KB 2025-12-04T12:46:46.3890247Z Alloc Alignment: 0KB 2025-12-04T12:46:46.3890404Z Accessible by all: FALSE 2025-12-04T12:46:46.3890537Z ISA Info: 2025-12-04T12:46:46.3890642Z ISA 1 2025-12-04T12:46:46.3890768Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:46:46.3890933Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3891092Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3891254Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3891418Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3891571Z Fast f16: TRUE 2025-12-04T12:46:46.3891720Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3891864Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3891990Z x 1024(0x400) 2025-12-04T12:46:46.3892121Z y 1024(0x400) 2025-12-04T12:46:46.3892251Z z 1024(0x400) 2025-12-04T12:46:46.3892389Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3892526Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3892648Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3892773Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3892939Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3893087Z FBarrier Max Size: 32 2025-12-04T12:46:46.3893220Z ISA 2 2025-12-04T12:46:46.3893356Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:46:46.3893524Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3893681Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3893838Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3893994Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3894142Z Fast f16: TRUE 2025-12-04T12:46:46.3894288Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3894436Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3894562Z x 1024(0x400) 2025-12-04T12:46:46.3894691Z y 1024(0x400) 2025-12-04T12:46:46.3894817Z z 1024(0x400) 2025-12-04T12:46:46.3894958Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3895089Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3895209Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3895334Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3895462Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3895604Z FBarrier Max Size: 32 2025-12-04T12:46:46.3895734Z ******* 2025-12-04T12:46:46.3895860Z Agent 5 2025-12-04T12:46:46.3895960Z ******* 2025-12-04T12:46:46.3896069Z Name: gfx942 2025-12-04T12:46:46.3896218Z Uuid: GPU-21bc7b5a7907f984 2025-12-04T12:46:46.3896370Z Marketing Name: AMD Instinct MI325X 2025-12-04T12:46:46.3896524Z Vendor Name: AMD 2025-12-04T12:46:46.3896673Z Feature: KERNEL_DISPATCH 2025-12-04T12:46:46.3896819Z Profile: BASE_PROFILE 2025-12-04T12:46:46.3896971Z Float Round Mode: NEAR 2025-12-04T12:46:46.3897122Z Max Queue Number: 128(0x80) 2025-12-04T12:46:46.3897268Z Queue Min Size: 64(0x40) 2025-12-04T12:46:46.3897413Z Queue Max Size: 131072(0x20000) 2025-12-04T12:46:46.3897618Z Queue Type: MULTI 2025-12-04T12:46:46.3897752Z Node: 4 2025-12-04T12:46:46.3897892Z Device Type: GPU 2025-12-04T12:46:46.3898019Z Cache Info: 2025-12-04T12:46:46.3898132Z L1: 32(0x20) KB 2025-12-04T12:46:46.3898258Z L2: 4096(0x1000) KB 2025-12-04T12:46:46.3898382Z L3: 262144(0x40000) KB 2025-12-04T12:46:46.3898510Z Chip ID: 29861(0x74a5) 2025-12-04T12:46:46.3898648Z ASIC Revision: 1(0x1) 2025-12-04T12:46:46.3898792Z Cacheline Size: 128(0x80) 2025-12-04T12:46:46.3898940Z Max Clock Freq. (MHz): 2100 2025-12-04T12:46:46.3899076Z BDFID: 25856 2025-12-04T12:46:46.3899227Z Internal Node ID: 4 2025-12-04T12:46:46.3899413Z Compute Unit: 304 2025-12-04T12:46:46.3899555Z SIMDs per CU: 4 2025-12-04T12:46:46.3899701Z Shader Engines: 32 2025-12-04T12:46:46.3899853Z Shader Arrs. per Eng.: 1 2025-12-04T12:46:46.3900006Z WatchPts on Addr. Ranges:4 2025-12-04T12:46:46.3900159Z Coherent Host Access: FALSE 2025-12-04T12:46:46.3900292Z Memory Properties: 2025-12-04T12:46:46.3900402Z Features: KERNEL_DISPATCH 2025-12-04T12:46:46.3900537Z Fast F16 Operation: TRUE 2025-12-04T12:46:46.3900683Z Wavefront Size: 64(0x40) 2025-12-04T12:46:46.3900833Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3900971Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3901091Z x 1024(0x400) 2025-12-04T12:46:46.3901214Z y 1024(0x400) 2025-12-04T12:46:46.3901331Z z 1024(0x400) 2025-12-04T12:46:46.3901467Z Max Waves Per CU: 32(0x20) 2025-12-04T12:46:46.3901615Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:46:46.3901759Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3901888Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3901999Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3902120Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3902244Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3902421Z Max fbarriers/Workgrp: 32 2025-12-04T12:46:46.3902578Z Packet Processor uCode:: 185 2025-12-04T12:46:46.3902736Z SDMA engine uCode:: 24 2025-12-04T12:46:46.3902885Z IOMMU Support:: None 2025-12-04T12:46:46.3903013Z Pool Info: 2025-12-04T12:46:46.3903113Z Pool 1 2025-12-04T12:46:46.3903236Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3903382Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3903526Z Allocatable: TRUE 2025-12-04T12:46:46.3903675Z Alloc Granule: 4KB 2025-12-04T12:46:46.3903833Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3903989Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3904145Z Accessible by all: FALSE 2025-12-04T12:46:46.3904277Z Pool 2 2025-12-04T12:46:46.3904401Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3904547Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3904687Z Allocatable: TRUE 2025-12-04T12:46:46.3904834Z Alloc Granule: 4KB 2025-12-04T12:46:46.3904989Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3905144Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3905293Z Accessible by all: FALSE 2025-12-04T12:46:46.3905424Z Pool 3 2025-12-04T12:46:46.3905543Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3905684Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3905824Z Allocatable: TRUE 2025-12-04T12:46:46.3905998Z Alloc Granule: 4KB 2025-12-04T12:46:46.3906153Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3906307Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3906457Z Accessible by all: FALSE 2025-12-04T12:46:46.3906589Z Pool 4 2025-12-04T12:46:46.3906705Z Segment: GROUP 2025-12-04T12:46:46.3906841Z Size: 64(0x40) KB 2025-12-04T12:46:46.3906979Z Allocatable: FALSE 2025-12-04T12:46:46.3907124Z Alloc Granule: 0KB 2025-12-04T12:46:46.3907281Z Alloc Recommended Granule:0KB 2025-12-04T12:46:46.3907435Z Alloc Alignment: 0KB 2025-12-04T12:46:46.3907628Z Accessible by all: FALSE 2025-12-04T12:46:46.3907759Z ISA Info: 2025-12-04T12:46:46.3907856Z ISA 1 2025-12-04T12:46:46.3907979Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:46:46.3908141Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3908293Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3908448Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3908605Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3908750Z Fast f16: TRUE 2025-12-04T12:46:46.3908895Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3909067Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3909190Z x 1024(0x400) 2025-12-04T12:46:46.3909317Z y 1024(0x400) 2025-12-04T12:46:46.3909437Z z 1024(0x400) 2025-12-04T12:46:46.3909570Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3909703Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3909817Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3909943Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3910066Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3910204Z FBarrier Max Size: 32 2025-12-04T12:46:46.3910333Z ISA 2 2025-12-04T12:46:46.3910466Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:46:46.3910632Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3910788Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3910940Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3911096Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3911242Z Fast f16: TRUE 2025-12-04T12:46:46.3911385Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3911523Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3911643Z x 1024(0x400) 2025-12-04T12:46:46.3911762Z y 1024(0x400) 2025-12-04T12:46:46.3911884Z z 1024(0x400) 2025-12-04T12:46:46.3912020Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3912154Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3912308Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3912432Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3912557Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3912694Z FBarrier Max Size: 32 2025-12-04T12:46:46.3912820Z ******* 2025-12-04T12:46:46.3912913Z Agent 6 2025-12-04T12:46:46.3913004Z ******* 2025-12-04T12:46:46.3913110Z Name: gfx942 2025-12-04T12:46:46.3913248Z Uuid: GPU-992528b4a4dce35a 2025-12-04T12:46:46.3913394Z Marketing Name: AMD Instinct MI325X 2025-12-04T12:46:46.3913545Z Vendor Name: AMD 2025-12-04T12:46:46.3913770Z Feature: KERNEL_DISPATCH 2025-12-04T12:46:46.3913917Z Profile: BASE_PROFILE 2025-12-04T12:46:46.3914062Z Float Round Mode: NEAR 2025-12-04T12:46:46.3914207Z Max Queue Number: 128(0x80) 2025-12-04T12:46:46.3914351Z Queue Min Size: 64(0x40) 2025-12-04T12:46:46.3914492Z Queue Max Size: 131072(0x20000) 2025-12-04T12:46:46.3914632Z Queue Type: MULTI 2025-12-04T12:46:46.3914766Z Node: 5 2025-12-04T12:46:46.3914902Z Device Type: GPU 2025-12-04T12:46:46.3915025Z Cache Info: 2025-12-04T12:46:46.3915134Z L1: 32(0x20) KB 2025-12-04T12:46:46.3915288Z L2: 4096(0x1000) KB 2025-12-04T12:46:46.3915415Z L3: 262144(0x40000) KB 2025-12-04T12:46:46.3915543Z Chip ID: 29861(0x74a5) 2025-12-04T12:46:46.3915680Z ASIC Revision: 1(0x1) 2025-12-04T12:46:46.3915824Z Cacheline Size: 128(0x80) 2025-12-04T12:46:46.3915972Z Max Clock Freq. (MHz): 2100 2025-12-04T12:46:46.3916108Z BDFID: 5376 2025-12-04T12:46:46.3916246Z Internal Node ID: 5 2025-12-04T12:46:46.3916391Z Compute Unit: 304 2025-12-04T12:46:46.3916529Z SIMDs per CU: 4 2025-12-04T12:46:46.3916673Z Shader Engines: 32 2025-12-04T12:46:46.3916826Z Shader Arrs. per Eng.: 1 2025-12-04T12:46:46.3916979Z WatchPts on Addr. Ranges:4 2025-12-04T12:46:46.3917132Z Coherent Host Access: FALSE 2025-12-04T12:46:46.3917263Z Memory Properties: 2025-12-04T12:46:46.3917374Z Features: KERNEL_DISPATCH 2025-12-04T12:46:46.3917542Z Fast F16 Operation: TRUE 2025-12-04T12:46:46.3917690Z Wavefront Size: 64(0x40) 2025-12-04T12:46:46.3917838Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3917975Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3918092Z x 1024(0x400) 2025-12-04T12:46:46.3918213Z y 1024(0x400) 2025-12-04T12:46:46.3918332Z z 1024(0x400) 2025-12-04T12:46:46.3918470Z Max Waves Per CU: 32(0x20) 2025-12-04T12:46:46.3918649Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:46:46.3918794Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3918923Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3919051Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3919173Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3919299Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3919438Z Max fbarriers/Workgrp: 32 2025-12-04T12:46:46.3919596Z Packet Processor uCode:: 185 2025-12-04T12:46:46.3919749Z SDMA engine uCode:: 24 2025-12-04T12:46:46.3919896Z IOMMU Support:: None 2025-12-04T12:46:46.3920024Z Pool Info: 2025-12-04T12:46:46.3920121Z Pool 1 2025-12-04T12:46:46.3920246Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:46:46.3920394Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3920538Z Allocatable: TRUE 2025-12-04T12:46:46.3920686Z Alloc Granule: 4KB 2025-12-04T12:46:46.3920842Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3921000Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3921150Z Accessible by all: FALSE 2025-12-04T12:46:46.3921280Z Pool 2 2025-12-04T12:46:46.3921403Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:46:46.3921545Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3921718Z Allocatable: TRUE 2025-12-04T12:46:46.3921866Z Alloc Granule: 4KB 2025-12-04T12:46:46.3922024Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3922180Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3922330Z Accessible by all: FALSE 2025-12-04T12:46:46.3922468Z Pool 3 2025-12-04T12:46:46.3922588Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:46:46.3922727Z Size: 268419072(0xfffc000) KB 2025-12-04T12:46:46.3922866Z Allocatable: TRUE 2025-12-04T12:46:46.3923012Z Alloc Granule: 4KB 2025-12-04T12:46:46.3923166Z Alloc Recommended Granule:2048KB 2025-12-04T12:46:46.3923325Z Alloc Alignment: 4KB 2025-12-04T12:46:46.3923475Z Accessible by all: FALSE 2025-12-04T12:46:46.3923609Z Pool 4 2025-12-04T12:46:46.3923725Z Segment: GROUP 2025-12-04T12:46:46.3923858Z Size: 64(0x40) KB 2025-12-04T12:46:46.3923997Z Allocatable: FALSE 2025-12-04T12:46:46.3924144Z Alloc Granule: 0KB 2025-12-04T12:46:46.3924298Z Alloc Recommended Granule:0KB 2025-12-04T12:46:46.3924452Z Alloc Alignment: 0KB 2025-12-04T12:46:46.3924600Z Accessible by all: FALSE 2025-12-04T12:46:46.3924736Z ISA Info: 2025-12-04T12:46:46.3924834Z ISA 1 2025-12-04T12:46:46.3924959Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:46:46.3925142Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3925294Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3925448Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3925603Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3925746Z Fast f16: TRUE 2025-12-04T12:46:46.3925891Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3926029Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3926149Z x 1024(0x400) 2025-12-04T12:46:46.3926272Z y 1024(0x400) 2025-12-04T12:46:46.3926394Z z 1024(0x400) 2025-12-04T12:46:46.3926528Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3926662Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3926780Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3926906Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3927030Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3927166Z FBarrier Max Size: 32 2025-12-04T12:46:46.3927295Z ISA 2 2025-12-04T12:46:46.3927425Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:46:46.3927631Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:46:46.3927784Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:46:46.3927934Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3928126Z Default Rounding Mode: NEAR 2025-12-04T12:46:46.3928279Z Fast f16: TRUE 2025-12-04T12:46:46.3928421Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:46:46.3928557Z Workgroup Max Size per Dimension: 2025-12-04T12:46:46.3928691Z x 1024(0x400) 2025-12-04T12:46:46.3928817Z y 1024(0x400) 2025-12-04T12:46:46.3928949Z z 1024(0x400) 2025-12-04T12:46:46.3929093Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:46:46.3929234Z Grid Max Size per Dimension: 2025-12-04T12:46:46.3929361Z x 4294967295(0xffffffff) 2025-12-04T12:46:46.3929492Z y 4294967295(0xffffffff) 2025-12-04T12:46:46.3929631Z z 4294967295(0xffffffff) 2025-12-04T12:46:46.3929782Z FBarrier Max Size: 32 2025-12-04T12:46:46.3929918Z *** Done *** 2025-12-04T12:46:46.3949819Z ##[group]Run ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T12:46:46.3950086Z ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T12:46:46.3950475Z msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified" 2025-12-04T12:46:46.3950854Z if [[ $ngpu -eq 0 ]]; then 2025-12-04T12:46:46.3951064Z  echo "Error: Failed to detect any GPUs on the runner" 2025-12-04T12:46:46.3951269Z  echo "$msg" 2025-12-04T12:46:46.3951421Z  exit 1 2025-12-04T12:46:46.3951550Z fi 2025-12-04T12:46:46.3955628Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.3955852Z env: 2025-12-04T12:46:46.3955985Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.3956134Z ##[endgroup] 2025-12-04T12:46:46.4998354Z ##[group]Run pytorch/pytorch/.github/actions/diskspace-cleanup@main 2025-12-04T12:46:46.4998534Z with: 2025-12-04T12:46:46.4998635Z diskspace-cutoff: 70 2025-12-04T12:46:46.4998740Z env: 2025-12-04T12:46:46.4998836Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.4998941Z ##[endgroup] 2025-12-04T12:46:46.5019015Z ##[group]Run set -ex 2025-12-04T12:46:46.5019152Z set -ex 2025-12-04T12:46:46.5019252Z diskspace_cutoff=70 2025-12-04T12:46:46.5019398Z docker_root_dir=$(docker info -f '{{.DockerRootDir}}') 2025-12-04T12:46:46.5019558Z if [ ! -d "$docker_root_dir" ]; then 2025-12-04T12:46:46.5019757Z  echo "Docker root directory ($docker_root_dir) does not exist. Skipping disk space check." 2025-12-04T12:46:46.5019945Z  exit 0 2025-12-04T12:46:46.5020034Z fi 2025-12-04T12:46:46.5020214Z diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //') 2025-12-04T12:46:46.5020541Z msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified" 2025-12-04T12:46:46.5020821Z if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then 2025-12-04T12:46:46.5020967Z  docker system prune -af 2025-12-04T12:46:46.5021158Z  diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //') 2025-12-04T12:46:46.5021371Z  if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then 2025-12-04T12:46:46.5021535Z  diskspace_cutoff_int=$((diskspace_cutoff + 0)) 2025-12-04T12:46:46.5021686Z  difference=$((100 - diskspace_cutoff_int)) 2025-12-04T12:46:46.5021895Z  echo "Error: Available diskspace is less than $difference percent. Not enough diskspace." 2025-12-04T12:46:46.5022201Z  echo "$msg" 2025-12-04T12:46:46.5022303Z  exit 1 2025-12-04T12:46:46.5022401Z  else 2025-12-04T12:46:46.5022518Z  difference=$((diskspace - diskspace_new)) 2025-12-04T12:46:46.5022668Z  echo "Diskspace saved: $difference percent" 2025-12-04T12:46:46.5022799Z  fi 2025-12-04T12:46:46.5022883Z fi 2025-12-04T12:46:46.5027645Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.5027789Z env: 2025-12-04T12:46:46.5027875Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.5027979Z ##[endgroup] 2025-12-04T12:46:46.5044758Z + diskspace_cutoff=70 2025-12-04T12:46:46.5048441Z ++ docker info -f '{{.DockerRootDir}}' 2025-12-04T12:46:46.5351135Z + docker_root_dir=/home/runner/docker-data 2025-12-04T12:46:46.5351417Z + '[' '!' -d /home/runner/docker-data ']' 2025-12-04T12:46:46.5359162Z ++ df -H --output=pcent /home/runner/docker-data 2025-12-04T12:46:46.5360046Z ++ sed -n 2p 2025-12-04T12:46:46.5361485Z ++ sed s/%// 2025-12-04T12:46:46.5361883Z ++ sed 's/ //' 2025-12-04T12:46:46.5375236Z + diskspace=' 3' 2025-12-04T12:46:46.5375839Z + msg='Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified' 2025-12-04T12:46:46.5376221Z + [[ 3 -ge 70 ]] 2025-12-04T12:46:46.5402223Z ##[group]Run RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-12-04T12:46:46.5402480Z RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-12-04T12:46:46.5402635Z rm -rf "${RUNNER_ARTIFACT_DIR}" 2025-12-04T12:46:46.5402781Z mkdir -p "${RUNNER_ARTIFACT_DIR}" 2025-12-04T12:46:46.5402963Z echo "RUNNER_ARTIFACT_DIR=${RUNNER_ARTIFACT_DIR}" >> "${GITHUB_ENV}" 2025-12-04T12:46:46.5403133Z  2025-12-04T12:46:46.5403262Z RUNNER_TEST_RESULTS_DIR="${RUNNER_TEMP}/test-results" 2025-12-04T12:46:46.5403431Z rm -rf "${RUNNER_TEST_RESULTS_DIR}" 2025-12-04T12:46:46.5403585Z mkdir -p "${RUNNER_TEST_RESULTS_DIR}" 2025-12-04T12:46:46.5403776Z echo "RUNNER_TEST_RESULTS_DIR=${RUNNER_TEST_RESULTS_DIR}" >> "${GITHUB_ENV}" 2025-12-04T12:46:46.5403945Z  2025-12-04T12:46:46.5404263Z RUNNER_DOCS_DIR="${RUNNER_TEMP}/docs" 2025-12-04T12:46:46.5404401Z rm -rf "${RUNNER_DOCS_DIR}" 2025-12-04T12:46:46.5404530Z mkdir -p "${RUNNER_DOCS_DIR}" 2025-12-04T12:46:46.5404695Z echo "RUNNER_DOCS_DIR=${RUNNER_DOCS_DIR}" >> "${GITHUB_ENV}" 2025-12-04T12:46:46.5409787Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.5409936Z env: 2025-12-04T12:46:46.5410031Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.5410136Z ##[endgroup] 2025-12-04T12:46:46.5479367Z ##[group]Run env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:46.5479595Z env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:46.5479795Z env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:46.5483317Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.5483463Z env: 2025-12-04T12:46:46.5483568Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.5483708Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:46.5483887Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:46.5484059Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:46.5484197Z ##[endgroup] 2025-12-04T12:46:46.5542880Z ##[group]Run # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py. 2025-12-04T12:46:46.5543197Z # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py. 2025-12-04T12:46:46.5543405Z # Add render group for container creation. 2025-12-04T12:46:46.5543581Z render_gid=`cat /etc/group | grep render | cut -d: -f3` 2025-12-04T12:46:46.5543781Z # Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG. 2025-12-04T12:46:46.5544115Z if [ -f "/etc/podinfo/gha-render-devices" ]; then 2025-12-04T12:46:46.5544303Z  DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices) 2025-12-04T12:46:46.5544442Z else 2025-12-04T12:46:46.5544550Z  DEVICE_FLAG="--device /dev/dri" 2025-12-04T12:46:46.5544674Z fi 2025-12-04T12:46:46.5544858Z # The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively. 2025-12-04T12:46:46.5545143Z # This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal. 2025-12-04T12:46:46.5545399Z # This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries. 2025-12-04T12:46:46.5545665Z # The group name corresponding to group ID 1 can change depending on the OS, so both are necessary. 2025-12-04T12:46:46.5546113Z echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}" 2025-12-04T12:46:46.5551369Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:46.5551513Z env: 2025-12-04T12:46:46.5551609Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.5551744Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:46.5551920Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:46.5552088Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:46.5552210Z ##[endgroup] 2025-12-04T12:46:46.5629829Z ##[group]Run aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 2025-12-04T12:46:46.5630049Z with: 2025-12-04T12:46:46.5630196Z role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only 2025-12-04T12:46:46.5630365Z aws-region: us-east-1 2025-12-04T12:46:46.5630488Z role-duration-seconds: 18000 2025-12-04T12:46:46.5630604Z audience: sts.amazonaws.com 2025-12-04T12:46:46.5630709Z env: 2025-12-04T12:46:46.5630796Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:46.5631059Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:46.5631228Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:46.5631395Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:46.5631892Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:46.5632366Z ##[endgroup] 2025-12-04T12:46:46.8558970Z Assuming role with OIDC 2025-12-04T12:46:47.1965816Z Authenticated as assumedRoleId AROAUPVRELQNLLCOPFEJR:GitHubActions 2025-12-04T12:46:47.2933459Z ##[group]Run aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 2025-12-04T12:46:47.2933689Z with: 2025-12-04T12:46:47.2933800Z mask-password: true 2025-12-04T12:46:47.2933947Z registry-type: private 2025-12-04T12:46:47.2934072Z skip-logout: false 2025-12-04T12:46:47.2934184Z env: 2025-12-04T12:46:47.2934291Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:47.2934449Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:47.2934655Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:47.2934844Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:47.2935422Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:47.2935986Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:47.2936281Z AWS_REGION: us-east-1 2025-12-04T12:46:47.2936751Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:47.2936933Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:47.2939185Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:47.2939292Z ##[endgroup] 2025-12-04T12:46:47.7118003Z Logging into registry 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.3261257Z ##[group]Run env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:48.3261533Z env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:48.3261747Z env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:48.3261955Z env | grep '^RUNNER' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:46:48.3266790Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:48.3266941Z env: 2025-12-04T12:46:48.3267043Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:48.3267182Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:48.3267380Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:48.3267598Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:48.3268121Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:48.3268619Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:48.3268742Z AWS_REGION: us-east-1 2025-12-04T12:46:48.3268978Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:48.3269142Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:48.3271278Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:48.3271393Z ##[endgroup] 2025-12-04T12:46:48.3366286Z ##[group]Run ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T12:46:48.3366514Z ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T12:46:48.3366779Z if [[ $ngpu -lt 2 ]]; then #We are temporarily reducing this down to 2 from 4 so that we can run tests on nodes with less gpus. 2025-12-04T12:46:48.3367067Z  echo "Error: only $ngpu GPU(s) detected, at least 2 GPUs are needed for distributed jobs" 2025-12-04T12:46:48.3367256Z  exit 1 2025-12-04T12:46:48.3367355Z fi 2025-12-04T12:46:48.3372108Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:48.3372256Z env: 2025-12-04T12:46:48.3372360Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:48.3372499Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:48.3372680Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:48.3372852Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:48.3373397Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:48.3373897Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:48.3374021Z AWS_REGION: us-east-1 2025-12-04T12:46:48.3374248Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:48.3374411Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:48.3376509Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:48.3376623Z ##[endgroup] 2025-12-04T12:46:48.4447201Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2025-12-04T12:46:48.4447392Z with: 2025-12-04T12:46:48.4447749Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4448052Z use-custom-docker-registry: true 2025-12-04T12:46:48.4448179Z docker-build-dir: .ci/docker 2025-12-04T12:46:48.4448300Z docker-build-script: ./build.sh 2025-12-04T12:46:48.4448547Z working-directory: . 2025-12-04T12:46:48.4448690Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4448845Z force-push: false 2025-12-04T12:46:48.4448940Z env: 2025-12-04T12:46:48.4449028Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:48.4449165Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:48.4449339Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:48.4449533Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:48.4450037Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:48.4450524Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:48.4450638Z AWS_REGION: us-east-1 2025-12-04T12:46:48.4450837Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:48.4450994Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:48.4453086Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:48.4453193Z ##[endgroup] 2025-12-04T12:46:48.4461928Z ##[group]Run set -ex 2025-12-04T12:46:48.4462060Z set -ex 2025-12-04T12:46:48.4462158Z  2025-12-04T12:46:48.4462313Z # If the docker build directory or the build script doesn't exist, the action will 2025-12-04T12:46:48.4462563Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2025-12-04T12:46:48.4462774Z # job could then download the pre-built image as usual 2025-12-04T12:46:48.4463028Z if [[ -d "${DOCKER_BUILD_DIR}" ]] && [[ -f "${DOCKER_BUILD_DIR}/${DOCKER_BUILD_SCRIPT}" ]] && [[ "${USE_CUSTOM_DOCKER_REGISTRY}" == "true" ]]; then 2025-12-04T12:46:48.4463264Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4463398Z else 2025-12-04T12:46:48.4463509Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4463684Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4463836Z  2025-12-04T12:46:48.4464041Z  echo "Not using custom ECR registry. Either it was not requested or there is no Docker build script in the ${REPO_NAME} repo..." 2025-12-04T12:46:48.4464267Z  exit 0 2025-12-04T12:46:48.4464360Z fi 2025-12-04T12:46:48.4464448Z  2025-12-04T12:46:48.4464586Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2025-12-04T12:46:48.4464810Z  # The docker image name already includes the ECR prefix and tag, so we can just 2025-12-04T12:46:48.4465009Z  # use it as it is, but first let's extract the tag 2025-12-04T12:46:48.4465194Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2025-12-04T12:46:48.4465389Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4465575Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4465728Z else 2025-12-04T12:46:48.4465839Z  if [[ "${DOCKER_IMAGE_NAME}" == *:* ]]; then 2025-12-04T12:46:48.4465992Z  CUSTOM_TAG_PREFIX=${DOCKER_IMAGE_NAME#*:} 2025-12-04T12:46:48.4466143Z  DOCKER_IMAGE_NAME=${DOCKER_IMAGE_NAME%%:*} 2025-12-04T12:46:48.4466271Z  fi 2025-12-04T12:46:48.4466522Z  DOCKER_TAG=${CUSTOM_TAG_PREFIX:+${CUSTOM_TAG_PREFIX}-}$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2025-12-04T12:46:48.4466751Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4466986Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4467241Z  echo "custom-tag-prefix=${CUSTOM_TAG_PREFIX}" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4467400Z fi 2025-12-04T12:46:48.4472188Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:48.4472333Z env: 2025-12-04T12:46:48.4472427Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:48.4472564Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:48.4472744Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:48.4472915Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:48.4473421Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:48.4473906Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:48.4474023Z AWS_REGION: us-east-1 2025-12-04T12:46:48.4474160Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:48.4474316Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:48.4476431Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:48.4476547Z REPO_NAME: pytorch 2025-12-04T12:46:48.4476824Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4477114Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T12:46:48.4477233Z DOCKER_BUILD_SCRIPT: ./build.sh 2025-12-04T12:46:48.4477385Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4477576Z USE_CUSTOM_DOCKER_REGISTRY: true 2025-12-04T12:46:48.4477703Z CUSTOM_TAG_PREFIX: 2025-12-04T12:46:48.4477810Z ##[endgroup] 2025-12-04T12:46:48.4509516Z + [[ -d .ci/docker ]] 2025-12-04T12:46:48.4509655Z + [[ -f .ci/docker/./build.sh ]] 2025-12-04T12:46:48.4510063Z + [[ true == \t\r\u\e ]] 2025-12-04T12:46:48.4510265Z + echo skip=false 2025-12-04T12:46:48.4510825Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2025-12-04T12:46:48.4517545Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4518262Z ++ awk -F '[:,]' '{print $2}' 2025-12-04T12:46:48.4528645Z + DOCKER_TAG=pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4529029Z + echo docker-tag=pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4529686Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4551792Z ##[group]Run set +e 2025-12-04T12:46:48.4551974Z set +e 2025-12-04T12:46:48.4552093Z set -x 2025-12-04T12:46:48.4552222Z  2025-12-04T12:46:48.4552338Z login() { 2025-12-04T12:46:48.4552591Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T12:46:48.4552861Z } 2025-12-04T12:46:48.4552974Z  2025-12-04T12:46:48.4553090Z retry () { 2025-12-04T12:46:48.4553231Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T12:46:48.4553402Z } 2025-12-04T12:46:48.4553511Z  2025-12-04T12:46:48.4553636Z retry login "${DOCKER_REGISTRY}" 2025-12-04T12:46:48.4553784Z  2025-12-04T12:46:48.4554075Z START_TIME=$(date +%s) 2025-12-04T12:46:48.4554233Z # Wait up to 120 minutes 2025-12-04T12:46:48.4554425Z while [[ $(( $(date +%s) - 7200 )) -lt $START_TIME ]]; do 2025-12-04T12:46:48.4554655Z  # Check if image already exists, if it does then skip building it 2025-12-04T12:46:48.4554900Z  if docker manifest inspect "${DOCKER_IMAGE}"; then 2025-12-04T12:46:48.4555081Z  exit 0 2025-12-04T12:46:48.4555204Z  fi 2025-12-04T12:46:48.4555475Z  2025-12-04T12:46:48.4555671Z  # NB: This flag is used by Docker build workflow to push the image to ECR, so we can 2025-12-04T12:46:48.4555980Z  # use this to differentiate between the Docker build and regular build jobs. For the 2025-12-04T12:46:48.4556288Z  # latter, it will wait for the Docker images to become available before continuing 2025-12-04T12:46:48.4556548Z  if [ "${DOCKER_PUSH:-false}" == "true" ]; then 2025-12-04T12:46:48.4556764Z  # It's a Docker build job, let's build the image 2025-12-04T12:46:48.4556938Z  break 2025-12-04T12:46:48.4557109Z  else 2025-12-04T12:46:48.4557281Z  # It's a regular build job, wait for the image to become available 2025-12-04T12:46:48.4557450Z  sleep 300 2025-12-04T12:46:48.4557599Z  fi 2025-12-04T12:46:48.4557690Z done 2025-12-04T12:46:48.4557782Z  2025-12-04T12:46:48.4557928Z # NB: This part requires a full checkout. Otherwise, the merge base will 2025-12-04T12:46:48.4558147Z # be empty. The default action would be to continue rebuild the image 2025-12-04T12:46:48.4558347Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2025-12-04T12:46:48.4558529Z  # if we're on the base branch then use the parent commit 2025-12-04T12:46:48.4558699Z  MERGE_BASE=$(git rev-parse HEAD~) 2025-12-04T12:46:48.4558833Z else 2025-12-04T12:46:48.4558968Z  # otherwise we're on a PR, so use the most recent base commit 2025-12-04T12:46:48.4559159Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2025-12-04T12:46:48.4559305Z fi 2025-12-04T12:46:48.4559398Z  2025-12-04T12:46:48.4559497Z if [[ -z "${MERGE_BASE}" ]]; then 2025-12-04T12:46:48.4559641Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4559775Z  2025-12-04T12:46:48.4559961Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2025-12-04T12:46:48.4560173Z  exit 0 2025-12-04T12:46:48.4560269Z fi 2025-12-04T12:46:48.4560362Z  2025-12-04T12:46:48.4560487Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2025-12-04T12:46:48.4560744Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2025-12-04T12:46:48.4560970Z  exit 1 2025-12-04T12:46:48.4561066Z fi 2025-12-04T12:46:48.4561155Z  2025-12-04T12:46:48.4561304Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2025-12-04T12:46:48.4561554Z # If no image exists but the hash is the same as the previous hash then we should error out here 2025-12-04T12:46:48.4561781Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2025-12-04T12:46:48.4562035Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2025-12-04T12:46:48.4562325Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2025-12-04T12:46:48.4562504Z fi 2025-12-04T12:46:48.4562603Z  2025-12-04T12:46:48.4562718Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T12:46:48.4565825Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:48.4566022Z env: 2025-12-04T12:46:48.4566118Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:48.4566264Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:48.4566450Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:48.4566624Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:48.4567134Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:48.4567699Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:48.4567820Z AWS_REGION: us-east-1 2025-12-04T12:46:48.4567982Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:48.4568147Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:48.4570259Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:48.4570402Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T12:46:48.4570554Z BASE_REVISION: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:46:48.4570877Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4571244Z DOCKER_TAG: pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:48.4571482Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4571641Z DOCKER_PUSH: 2025-12-04T12:46:48.4571756Z ##[endgroup] 2025-12-04T12:46:48.4589804Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4590000Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4592623Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:48.4592998Z /home/runner/_work/_temp/83313167-e8b3-413e-8cc4-1212c5d6c6a6.sh: line 5: aws: command not found 2025-12-04T12:46:48.4593461Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:48.4678463Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:48.4686800Z + sleep 1 2025-12-04T12:46:49.4695089Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:49.4698279Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:49.4698825Z /home/runner/_work/_temp/83313167-e8b3-413e-8cc4-1212c5d6c6a6.sh: line 5: aws: command not found 2025-12-04T12:46:49.4700577Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:49.4792930Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:49.4803711Z + sleep 2 2025-12-04T12:46:51.4815622Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:51.4819170Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:51.4819983Z /home/runner/_work/_temp/83313167-e8b3-413e-8cc4-1212c5d6c6a6.sh: line 5: aws: command not found 2025-12-04T12:46:51.4821038Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:51.4914777Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:51.4924015Z ++ date +%s 2025-12-04T12:46:51.4931190Z + START_TIME=1764852411 2025-12-04T12:46:51.4933694Z ++ date +%s 2025-12-04T12:46:51.4940701Z + [[ 1764845211 -lt 1764852411 ]] 2025-12-04T12:46:51.4941326Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:52.8263783Z { 2025-12-04T12:46:52.8264035Z "schemaVersion": 2, 2025-12-04T12:46:52.8264432Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2025-12-04T12:46:52.8264827Z "config": { 2025-12-04T12:46:52.8265134Z "mediaType": "application/vnd.docker.container.image.v1+json", 2025-12-04T12:46:52.8265482Z "size": 30520, 2025-12-04T12:46:52.8265978Z "digest": "sha256:45252333063339f104d56e41f20304e9511ab21c7768e8d156b95ddf24a9dbe5" 2025-12-04T12:46:52.8266829Z }, 2025-12-04T12:46:52.8267014Z "layers": [ 2025-12-04T12:46:52.8267202Z { 2025-12-04T12:46:52.8267574Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8267925Z "size": 30447951, 2025-12-04T12:46:52.8268299Z "digest": "sha256:63e5bc7682b85ae57a1221210f64d62e7a90b0a30f19af4ca734b8242ae49d63" 2025-12-04T12:46:52.8268680Z }, 2025-12-04T12:46:52.8268849Z { 2025-12-04T12:46:52.8269136Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8269636Z "size": 1554, 2025-12-04T12:46:52.8269982Z "digest": "sha256:835841cca3b7e1464290cdb78e48773e03583413fbed852c3cc5165a392ea44d" 2025-12-04T12:46:52.8270371Z }, 2025-12-04T12:46:52.8270540Z { 2025-12-04T12:46:52.8270821Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8271163Z "size": 313275691, 2025-12-04T12:46:52.8271537Z "digest": "sha256:aac69780afc8611a5f94a235792d39ae055249c8319ef43b78675998a9b2f825" 2025-12-04T12:46:52.8271910Z }, 2025-12-04T12:46:52.8272076Z { 2025-12-04T12:46:52.8272356Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8272689Z "size": 704, 2025-12-04T12:46:52.8273039Z "digest": "sha256:029495b23122c840ca0e52d487afa8d2c4dbf1991cd7f204ec3e434dcf947bf4" 2025-12-04T12:46:52.8273418Z }, 2025-12-04T12:46:52.8273586Z { 2025-12-04T12:46:52.8273860Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8274203Z "size": 1218, 2025-12-04T12:46:52.8274554Z "digest": "sha256:d0fb85b008332051a3f7c052721ef68bde404b46c23fa43ad040373bd367826c" 2025-12-04T12:46:52.8274932Z }, 2025-12-04T12:46:52.8275096Z { 2025-12-04T12:46:52.8275376Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8275709Z "size": 484, 2025-12-04T12:46:52.8276058Z "digest": "sha256:59b63930883363c7d2aaab27cc61555d9f3e119dc18247a8624c98ebdaa354a5" 2025-12-04T12:46:52.8276383Z }, 2025-12-04T12:46:52.8276517Z { 2025-12-04T12:46:52.8276716Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8276958Z "size": 110363202, 2025-12-04T12:46:52.8277220Z "digest": "sha256:dc112c89d57aa1e85082e40a56e5bc743d64f834ae2f98afe91f60c248354d38" 2025-12-04T12:46:52.8277544Z }, 2025-12-04T12:46:52.8277664Z { 2025-12-04T12:46:52.8277863Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8278104Z "size": 4436, 2025-12-04T12:46:52.8278354Z "digest": "sha256:522eab2402e5001810155ef7eb56940b7c01a4fef62ac588886981c3b8ee8e1e" 2025-12-04T12:46:52.8278622Z }, 2025-12-04T12:46:52.8278741Z { 2025-12-04T12:46:52.8278939Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8279180Z "size": 1755, 2025-12-04T12:46:52.8279428Z "digest": "sha256:2b5a11b41761d8ea3b829e4772e4064cb6c4e4989126af324d0057661e4493a1" 2025-12-04T12:46:52.8279697Z }, 2025-12-04T12:46:52.8279817Z { 2025-12-04T12:46:52.8280018Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8280259Z "size": 724, 2025-12-04T12:46:52.8280501Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T12:46:52.8280765Z }, 2025-12-04T12:46:52.8280884Z { 2025-12-04T12:46:52.8281082Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8281325Z "size": 3185588166, 2025-12-04T12:46:52.8281584Z "digest": "sha256:73e33534e9eb94cf29418d65944168962b65fe21f55e9b8bad18c76e9b3a37b8" 2025-12-04T12:46:52.8281858Z }, 2025-12-04T12:46:52.8281974Z { 2025-12-04T12:46:52.8282171Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8282413Z "size": 396, 2025-12-04T12:46:52.8282670Z "digest": "sha256:5bfdaeb5578d6ffcd7db29c48303cbceb13c591210feaa216a8daa7a6d445b4b" 2025-12-04T12:46:52.8282949Z }, 2025-12-04T12:46:52.8283066Z { 2025-12-04T12:46:52.8283334Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8283579Z "size": 236863, 2025-12-04T12:46:52.8283837Z "digest": "sha256:c07d27e4d3a5ba4ad5325bb785b2e4f058fe5e10ec1aeeb413a1e152b073f203" 2025-12-04T12:46:52.8284119Z }, 2025-12-04T12:46:52.8284239Z { 2025-12-04T12:46:52.8284703Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8284994Z "size": 787, 2025-12-04T12:46:52.8285254Z "digest": "sha256:b21856d1bf420da6fa8ec7331b82ab355d4f4178644e7d3a3d3d0fbc3610109a" 2025-12-04T12:46:52.8285590Z }, 2025-12-04T12:46:52.8285711Z { 2025-12-04T12:46:52.8285910Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8286153Z "size": 106, 2025-12-04T12:46:52.8286350Z "digest": "sha256:cb19d84867e4063f55db9459c28c50a2abc37c06d3c1ca82ba95fa8427cc438a" 2025-12-04T12:46:52.8286556Z }, 2025-12-04T12:46:52.8286647Z { 2025-12-04T12:46:52.8286803Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8286987Z "size": 1496, 2025-12-04T12:46:52.8287177Z "digest": "sha256:8165374f8dccf88a7791a5d31afbe29e4d4542b4f1cf1904945e07f9af6bf8ba" 2025-12-04T12:46:52.8287385Z }, 2025-12-04T12:46:52.8287519Z { 2025-12-04T12:46:52.8287669Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8287883Z "size": 458789560, 2025-12-04T12:46:52.8288086Z "digest": "sha256:1aecc77354ceba59ec6f0d37a558f2dbb6d5c0854553ee8505ac8707b422da6d" 2025-12-04T12:46:52.8288303Z }, 2025-12-04T12:46:52.8288397Z { 2025-12-04T12:46:52.8288547Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8288732Z "size": 164, 2025-12-04T12:46:52.8288923Z "digest": "sha256:465d3fd643aa2ea0ad07335cda66f12f1d7e5e800c4e9385ec466bc8a1ceabda" 2025-12-04T12:46:52.8289134Z }, 2025-12-04T12:46:52.8289227Z { 2025-12-04T12:46:52.8289376Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8289565Z "size": 104, 2025-12-04T12:46:52.8289752Z "digest": "sha256:6c503e779d6f41ca7f51309875df2b725c171926aece7009c4b8a64d1ba3f58e" 2025-12-04T12:46:52.8289959Z }, 2025-12-04T12:46:52.8290052Z { 2025-12-04T12:46:52.8290204Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8290388Z "size": 724, 2025-12-04T12:46:52.8290572Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T12:46:52.8290774Z }, 2025-12-04T12:46:52.8290866Z { 2025-12-04T12:46:52.8291022Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8291205Z "size": 196, 2025-12-04T12:46:52.8291395Z "digest": "sha256:f7e9a021f0ee3d11a50dcb96378af8103a21f6c3c142f54529207648f3ed00b2" 2025-12-04T12:46:52.8291602Z }, 2025-12-04T12:46:52.8291696Z { 2025-12-04T12:46:52.8291848Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8292033Z "size": 2583, 2025-12-04T12:46:52.8292224Z "digest": "sha256:8e023b349080fb11ee55491bc9b842b30e9e3a90246d05b303a73dc62038caf2" 2025-12-04T12:46:52.8292430Z }, 2025-12-04T12:46:52.8292521Z { 2025-12-04T12:46:52.8292671Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8292858Z "size": 7577171420, 2025-12-04T12:46:52.8293053Z "digest": "sha256:8188df80e595a3dbcf84623c6a58a655269898cbb60029435f136d7f9d34ccaa" 2025-12-04T12:46:52.8293258Z }, 2025-12-04T12:46:52.8293352Z { 2025-12-04T12:46:52.8293503Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8293689Z "size": 135, 2025-12-04T12:46:52.8293882Z "digest": "sha256:3c2c2f8c74bfa16c4bf9a832c97bbb1d55205b2b4a2cead02cf74301ca1001fb" 2025-12-04T12:46:52.8294094Z }, 2025-12-04T12:46:52.8294186Z { 2025-12-04T12:46:52.8294338Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8294522Z "size": 104, 2025-12-04T12:46:52.8294777Z "digest": "sha256:2aa7784fbe3300f8bbfb6bb51cff3b01fd091e829c2bc7ab9e25261a0dd9b3bd" 2025-12-04T12:46:52.8294990Z }, 2025-12-04T12:46:52.8295082Z { 2025-12-04T12:46:52.8295235Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8295423Z "size": 612, 2025-12-04T12:46:52.8295613Z "digest": "sha256:2b3b5215d3ebe8789f0444457bfd5a6e218289b64aa07653ac3d03ddda5e6708" 2025-12-04T12:46:52.8295821Z }, 2025-12-04T12:46:52.8295913Z { 2025-12-04T12:46:52.8296064Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8296276Z "size": 838191945, 2025-12-04T12:46:52.8296444Z "digest": "sha256:99b1f1ea3e857834cebd01763d90fbd700aeb9c2d2ef23eda2cfff5652c9708b" 2025-12-04T12:46:52.8296620Z }, 2025-12-04T12:46:52.8296698Z { 2025-12-04T12:46:52.8296822Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8296976Z "size": 111, 2025-12-04T12:46:52.8297141Z "digest": "sha256:18d6daba0a5768a37ad106b57974f6b7efd35c43a87c246bcd3f43fea88f2d2b" 2025-12-04T12:46:52.8297314Z }, 2025-12-04T12:46:52.8297390Z { 2025-12-04T12:46:52.8297571Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8297724Z "size": 1555, 2025-12-04T12:46:52.8297884Z "digest": "sha256:5277f2a503ebd17ba9d9b86cc9bac86265504adeb449c0647616ddaacd3cbc41" 2025-12-04T12:46:52.8298060Z }, 2025-12-04T12:46:52.8298134Z { 2025-12-04T12:46:52.8298264Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8298422Z "size": 107, 2025-12-04T12:46:52.8298578Z "digest": "sha256:3198a9717aace920fd5de085319adf75091af05fc4318ce4b16a8a5b0e8d449e" 2025-12-04T12:46:52.8298750Z }, 2025-12-04T12:46:52.8298827Z { 2025-12-04T12:46:52.8298951Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8299103Z "size": 166, 2025-12-04T12:46:52.8299255Z "digest": "sha256:99a4918e5808277879449e97ccd7190db6b9aa2d742b57a3b831ce0198522bdd" 2025-12-04T12:46:52.8299424Z }, 2025-12-04T12:46:52.8299503Z { 2025-12-04T12:46:52.8299628Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8299782Z "size": 3526081, 2025-12-04T12:46:52.8299941Z "digest": "sha256:15bb11dfc6acc3537d527d6771c8e711e5605e99f82ec41e805d4600b8a97516" 2025-12-04T12:46:52.8300112Z }, 2025-12-04T12:46:52.8300189Z { 2025-12-04T12:46:52.8300314Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8300467Z "size": 107, 2025-12-04T12:46:52.8300627Z "digest": "sha256:bd87c8766e90e33db17514558ac591cc3f4149afd7abeaef4dd5770bbfa14210" 2025-12-04T12:46:52.8300800Z }, 2025-12-04T12:46:52.8300875Z { 2025-12-04T12:46:52.8301001Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8301157Z "size": 829, 2025-12-04T12:46:52.8301311Z "digest": "sha256:1969e15d0c13874ea5883ed829235a19ef6dc21c8aa6172032b78a8ffa6ff262" 2025-12-04T12:46:52.8301481Z }, 2025-12-04T12:46:52.8301557Z { 2025-12-04T12:46:52.8301689Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8301845Z "size": 26973054, 2025-12-04T12:46:52.8302009Z "digest": "sha256:24a03847d382b73c11969f8f73916a6bedf5ccea12f6f4290b3880f29ceda32a" 2025-12-04T12:46:52.8302179Z }, 2025-12-04T12:46:52.8302258Z { 2025-12-04T12:46:52.8302383Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8302560Z + exit 0 2025-12-04T12:46:52.8302644Z "size": 104, 2025-12-04T12:46:52.8302803Z "digest": "sha256:816e2e34e01839a35d624dbf4bd9ac9bea4c975104af47a0e6b6b6dee6c6f98d" 2025-12-04T12:46:52.8302978Z }, 2025-12-04T12:46:52.8303054Z { 2025-12-04T12:46:52.8303179Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8303333Z "size": 424, 2025-12-04T12:46:52.8303488Z "digest": "sha256:b168858b85373f8ddca549d79267a06de4fa945d04bf791c55c9ddc93957fa3c" 2025-12-04T12:46:52.8303661Z }, 2025-12-04T12:46:52.8303738Z { 2025-12-04T12:46:52.8303898Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8304053Z "size": 19309386, 2025-12-04T12:46:52.8304220Z "digest": "sha256:6b8d5ff02e267e38322afbb8a58ed63ce9d75b10e9e73255e6affcbc6b6539bf" 2025-12-04T12:46:52.8304394Z }, 2025-12-04T12:46:52.8304472Z { 2025-12-04T12:46:52.8304601Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8304757Z "size": 826, 2025-12-04T12:46:52.8304914Z "digest": "sha256:4e3b10a5dd6aed29f238d604925e2a4f873141c1087c8dd4fdde5c61e7560893" 2025-12-04T12:46:52.8305152Z }, 2025-12-04T12:46:52.8305228Z { 2025-12-04T12:46:52.8305355Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8305509Z "size": 724, 2025-12-04T12:46:52.8305664Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T12:46:52.8305833Z }, 2025-12-04T12:46:52.8305910Z { 2025-12-04T12:46:52.8306042Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8306195Z "size": 149, 2025-12-04T12:46:52.8306351Z "digest": "sha256:3092fab73b59190b9facfc49bf18f58612172bc2fd68dfa339a1118632616939" 2025-12-04T12:46:52.8306522Z }, 2025-12-04T12:46:52.8306599Z { 2025-12-04T12:46:52.8306725Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8306877Z "size": 136, 2025-12-04T12:46:52.8307037Z "digest": "sha256:20020dd28a15ba092fcbfe906ee39cdddfcc9d0b7eb42fdd6f4c08a984fa9c00" 2025-12-04T12:46:52.8307216Z }, 2025-12-04T12:46:52.8307293Z { 2025-12-04T12:46:52.8307418Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8307610Z "size": 140, 2025-12-04T12:46:52.8307767Z "digest": "sha256:ae5280ce969dcff08c091e9a5f7641f13561b2b0ee44d78b7c3f81d8fe8e6d32" 2025-12-04T12:46:52.8307941Z }, 2025-12-04T12:46:52.8308020Z { 2025-12-04T12:46:52.8308144Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8308299Z "size": 32, 2025-12-04T12:46:52.8308460Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T12:46:52.8308633Z }, 2025-12-04T12:46:52.8308708Z { 2025-12-04T12:46:52.8308853Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8309004Z "size": 222, 2025-12-04T12:46:52.8309164Z "digest": "sha256:fe17d9eb0fd26d3af4c724bf570d833978b131cedb7dc17a800aa388a246b3cd" 2025-12-04T12:46:52.8309338Z }, 2025-12-04T12:46:52.8309421Z { 2025-12-04T12:46:52.8309547Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8309702Z "size": 346, 2025-12-04T12:46:52.8309856Z "digest": "sha256:a51e0dab2d596e6563483f27c12660007160847d177ba4c31812a8f44ada5754" 2025-12-04T12:46:52.8310028Z }, 2025-12-04T12:46:52.8310104Z { 2025-12-04T12:46:52.8310230Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8310382Z "size": 88300, 2025-12-04T12:46:52.8310547Z "digest": "sha256:6eb176cefd72d37ecbcdf074289a8f1de732d8816cc695ece7e4709d098094d6" 2025-12-04T12:46:52.8310722Z }, 2025-12-04T12:46:52.8310799Z { 2025-12-04T12:46:52.8310925Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8311077Z "size": 106, 2025-12-04T12:46:52.8311234Z "digest": "sha256:e7b8cf2e8d5a4c56db9726ce62c1176032408b3b1c25a000592361cb4245e2b5" 2025-12-04T12:46:52.8311406Z }, 2025-12-04T12:46:52.8311484Z { 2025-12-04T12:46:52.8311612Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8311767Z "size": 1671, 2025-12-04T12:46:52.8311928Z "digest": "sha256:ef3a5060abce88884bc8bd815aa41c46427f34eeb132fe0ddd85a3f86e6dc83d" 2025-12-04T12:46:52.8312104Z }, 2025-12-04T12:46:52.8312181Z { 2025-12-04T12:46:52.8312306Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8312457Z "size": 724, 2025-12-04T12:46:52.8312650Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T12:46:52.8312818Z }, 2025-12-04T12:46:52.8312895Z { 2025-12-04T12:46:52.8313022Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8313176Z "size": 138, 2025-12-04T12:46:52.8313336Z "digest": "sha256:a6f4ec14b42b8f0a83d20aa6a985ddb6a1bf64e0ed3d44afd3484b87d4ed5ad3" 2025-12-04T12:46:52.8313512Z }, 2025-12-04T12:46:52.8313589Z { 2025-12-04T12:46:52.8313715Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8313901Z "size": 119, 2025-12-04T12:46:52.8314060Z "digest": "sha256:7e5a0c956cfbd6f8074fbfd3b1d416e6635d632835ec00c8dd4c015a21da19b4" 2025-12-04T12:46:52.8314235Z }, 2025-12-04T12:46:52.8314312Z { 2025-12-04T12:46:52.8314437Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8314595Z "size": 6238423049, 2025-12-04T12:46:52.8314770Z "digest": "sha256:b4f78730cfe76ce091b78b2e2e3d52be03f1097b3e4c3de5bd79f8d13a853132" 2025-12-04T12:46:52.8314945Z }, 2025-12-04T12:46:52.8315022Z { 2025-12-04T12:46:52.8315147Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8315301Z "size": 174, 2025-12-04T12:46:52.8315453Z "digest": "sha256:081028f24389b112683689fd362e8c0d6f358082710e72feab91cea6383feb4d" 2025-12-04T12:46:52.8315619Z }, 2025-12-04T12:46:52.8315696Z { 2025-12-04T12:46:52.8315822Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8315978Z "size": 1896, 2025-12-04T12:46:52.8316144Z "digest": "sha256:a534dcf4b9a9e5fabed742c8a8fc43c9cfe7346ea88ab3c177c3b14fd3afe00a" 2025-12-04T12:46:52.8316323Z }, 2025-12-04T12:46:52.8316398Z { 2025-12-04T12:46:52.8316523Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8316678Z "size": 197577597, 2025-12-04T12:46:52.8316839Z "digest": "sha256:2e77500302cc13224427e1d74e471bd79d5109ba6a5099a83df1d10b786f71ba" 2025-12-04T12:46:52.8317014Z }, 2025-12-04T12:46:52.8317092Z { 2025-12-04T12:46:52.8317218Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8317371Z "size": 304, 2025-12-04T12:46:52.8317575Z "digest": "sha256:bc08246bb4ba18c3ec5bc69e16b6b4e929c5bd0f3fae10eeb0b1a622a63d6fa2" 2025-12-04T12:46:52.8317754Z }, 2025-12-04T12:46:52.8317831Z { 2025-12-04T12:46:52.8317953Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8318106Z "size": 32, 2025-12-04T12:46:52.8318267Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T12:46:52.8318441Z }, 2025-12-04T12:46:52.8318518Z { 2025-12-04T12:46:52.8318645Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8318798Z "size": 106, 2025-12-04T12:46:52.8318956Z "digest": "sha256:ff0c473ca120ebdcaa2ba10b3274e82032edd5196019e76d4e7584553704ae81" 2025-12-04T12:46:52.8319129Z }, 2025-12-04T12:46:52.8319209Z { 2025-12-04T12:46:52.8319334Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T12:46:52.8319488Z "size": 54145662, 2025-12-04T12:46:52.8319656Z "digest": "sha256:6bbc14b250efb3cdaad12c91573c6bb9129ad3e3432f0ed1a7eaebc9958d162f" 2025-12-04T12:46:52.8319830Z } 2025-12-04T12:46:52.8319908Z ] 2025-12-04T12:46:52.8319987Z } 2025-12-04T12:46:52.8335289Z ##[group]Run set -eux 2025-12-04T12:46:52.8335411Z set -eux 2025-12-04T12:46:52.8335577Z # It's ok if this steps fails, it would then be an anonymous user like what we used to have 2025-12-04T12:46:52.8335997Z aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token | jq --raw-output '.SecretString' | jq -r .docker_hub_readonly_token | docker login --username pytorchbot --password-stdin || true 2025-12-04T12:46:52.8340831Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:52.8340985Z env: 2025-12-04T12:46:52.8341083Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:52.8341277Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:52.8341462Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:52.8341632Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:52.8342147Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:52.8342678Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:52.8342801Z AWS_REGION: us-east-1 2025-12-04T12:46:52.8342992Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:52.8343154Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:52.8345273Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:52.8345388Z ##[endgroup] 2025-12-04T12:46:52.8371130Z + aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token 2025-12-04T12:46:52.8371666Z /home/runner/_work/_temp/bcc497dd-14b2-4585-8527-ed18d7164423.sh: line 3: aws: command not found 2025-12-04T12:46:52.8372084Z + jq --raw-output .SecretString 2025-12-04T12:46:52.8372330Z + jq -r .docker_hub_readonly_token 2025-12-04T12:46:52.8373807Z + docker login --username pytorchbot --password-stdin 2025-12-04T12:46:52.8466426Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:52.8473427Z + true 2025-12-04T12:46:52.8544534Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2025-12-04T12:46:52.8544731Z with: 2025-12-04T12:46:52.8544997Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:52.8545321Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:52.8545471Z env: 2025-12-04T12:46:52.8545564Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:52.8545702Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:52.8545880Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:52.8546048Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:52.8546578Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:52.8547069Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:52.8547190Z AWS_REGION: us-east-1 2025-12-04T12:46:52.8547368Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:52.8547586Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:52.8549699Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:52.8549808Z ##[endgroup] 2025-12-04T12:46:52.8556407Z ##[group]Run set -x 2025-12-04T12:46:52.8556528Z set -x 2025-12-04T12:46:52.8556624Z set +e 2025-12-04T12:46:52.8556718Z  2025-12-04T12:46:52.8556811Z login() { 2025-12-04T12:46:52.8556999Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T12:46:52.8557198Z } 2025-12-04T12:46:52.8557289Z  2025-12-04T12:46:52.8557377Z retry () { 2025-12-04T12:46:52.8557545Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T12:46:52.8557675Z } 2025-12-04T12:46:52.8557763Z  2025-12-04T12:46:52.8557862Z retry login "${DOCKER_REGISTRY}" 2025-12-04T12:46:52.8557980Z  2025-12-04T12:46:52.8558169Z IMAGE_SIZE=$(docker manifest inspect "${DOCKER_IMAGE}" | jq '[.layers[].size, .config.size] | add / 1024 / 1024') 2025-12-04T12:46:52.8558421Z echo "Compressed size of image in MB: ${IMAGE_SIZE}" 2025-12-04T12:46:52.8558569Z  2025-12-04T12:46:52.8558662Z set -e 2025-12-04T12:46:52.8558803Z # ignore output since only exit code is used for conditional 2025-12-04T12:46:52.8558992Z # only pull docker image if it's not available locally 2025-12-04T12:46:52.8559197Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2025-12-04T12:46:52.8559385Z  retry docker pull "${DOCKER_IMAGE}" 2025-12-04T12:46:52.8559508Z fi 2025-12-04T12:46:52.8563897Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:52.8564043Z env: 2025-12-04T12:46:52.8564238Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:52.8564373Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:52.8564552Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:52.8564720Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:52.8565223Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:52.8565713Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:52.8565832Z AWS_REGION: us-east-1 2025-12-04T12:46:52.8565973Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:52.8566129Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:52.8568260Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:52.8568772Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:52.8569099Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:52.8569253Z ##[endgroup] 2025-12-04T12:46:52.8587204Z + set +e 2025-12-04T12:46:52.8587593Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:52.8587917Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:52.8590022Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:52.8590397Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:52.8590855Z /home/runner/_work/_temp/c699e352-5260-4ae4-9fc4-287ebbbcda45.sh: line 5: aws: command not found 2025-12-04T12:46:52.8678876Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:52.8687963Z + sleep 1 2025-12-04T12:46:53.8699962Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:53.8702797Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:53.8703333Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:53.8703973Z /home/runner/_work/_temp/c699e352-5260-4ae4-9fc4-287ebbbcda45.sh: line 5: aws: command not found 2025-12-04T12:46:53.8794237Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:53.8807461Z + sleep 2 2025-12-04T12:46:55.8817717Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:55.8820424Z + aws ecr get-login-password --region us-east-1 2025-12-04T12:46:55.8820941Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T12:46:55.8821562Z /home/runner/_work/_temp/c699e352-5260-4ae4-9fc4-287ebbbcda45.sh: line 5: aws: command not found 2025-12-04T12:46:55.8914708Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T12:46:55.8929134Z ++ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:55.8929780Z ++ jq '[.layers[].size, .config.size] | add / 1024 / 1024' 2025-12-04T12:46:57.2191886Z + IMAGE_SIZE=18171.470620155334 2025-12-04T12:46:57.2192429Z + echo 'Compressed size of image in MB: 18171.470620155334' 2025-12-04T12:46:57.2192854Z + set -e 2025-12-04T12:46:57.2193666Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:46:57.2194562Z Compressed size of image in MB: 18171.470620155334 2025-12-04T12:46:57.2334629Z Prepare all required actions 2025-12-04T12:46:57.2348472Z ##[group]Run ./.github/actions/get-workflow-job-id 2025-12-04T12:46:57.2348611Z with: 2025-12-04T12:46:57.2348853Z github-token: *** 2025-12-04T12:46:57.2348950Z env: 2025-12-04T12:46:57.2349044Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:57.2349183Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:57.2349379Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:57.2349666Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:57.2350175Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:57.2350663Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:57.2350811Z AWS_REGION: us-east-1 2025-12-04T12:46:57.2350969Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:57.2351122Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:57.2353229Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:57.2353333Z ##[endgroup] 2025-12-04T12:46:57.2359996Z ##[group]Run set -eux 2025-12-04T12:46:57.2360109Z set -eux 2025-12-04T12:46:57.2360276Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T12:46:57.2364750Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:46:57.2364896Z env: 2025-12-04T12:46:57.2364987Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:46:57.2365119Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:46:57.2365293Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:46:57.2365458Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:46:57.2365955Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:46:57.2366438Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:46:57.2366556Z AWS_REGION: us-east-1 2025-12-04T12:46:57.2366696Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:46:57.2366853Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:46:57.2369086Z AWS_SESSION_TOKEN: *** 2025-12-04T12:46:57.2369242Z GITHUB_TOKEN: *** 2025-12-04T12:46:57.2369340Z ##[endgroup] 2025-12-04T12:46:57.2384587Z + python3 .github/scripts/get_workflow_job_id.py 19921726347 linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8 2025-12-04T12:46:58.2997381Z Setting output job-id=57113808223 2025-12-04T12:46:58.2997936Z Setting output job-name=linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:46:58.3093958Z Prepare all required actions 2025-12-04T12:46:58.3094175Z Getting action download info 2025-12-04T12:46:58.5171700Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T12:46:59.3611696Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-12-04T12:47:00.1833559Z ##[group]Run ./.github/actions/download-build-artifacts 2025-12-04T12:47:00.1833716Z with: 2025-12-04T12:47:00.1833816Z name: linux-jammy-rocm-py3.10 2025-12-04T12:47:00.1833949Z s3-bucket: gha-artifacts 2025-12-04T12:47:00.1834060Z env: 2025-12-04T12:47:00.1834155Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:00.1834291Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:00.1834466Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:00.1834639Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:00.1835170Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:00.1835660Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:00.1835777Z AWS_REGION: us-east-1 2025-12-04T12:47:00.1835945Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:00.1836103Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:00.1838260Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:00.1838501Z ##[endgroup] 2025-12-04T12:47:00.1851586Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T12:47:00.1851723Z with: 2025-12-04T12:47:00.1851823Z name: linux-jammy-rocm-py3.10 2025-12-04T12:47:00.1851949Z s3-bucket: gha-artifacts 2025-12-04T12:47:00.1852061Z region: us-east-1 2025-12-04T12:47:00.1852159Z env: 2025-12-04T12:47:00.1852251Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:00.1852389Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:00.1852569Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:00.1852736Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:00.1853235Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:00.1853724Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:00.1853839Z AWS_REGION: us-east-1 2025-12-04T12:47:00.1853973Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:00.1854127Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:00.1856218Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:00.1856327Z ##[endgroup] 2025-12-04T12:47:00.4052858Z (node:17071) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T12:47:00.4053155Z 2025-12-04T12:47:00.4053280Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T12:47:00.4053603Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T12:47:00.4053921Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T12:47:00.6883773Z Found 1 objects with prefix pytorch/pytorch/19921726347/linux-jammy-rocm-py3.10/ 2025-12-04T12:47:00.6884303Z Starting download (1/1): /home/runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T12:47:34.9544676Z Finished download (1/1): /home/runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T12:47:34.9547961Z Artifact download has finished successfully 2025-12-04T12:47:34.9668253Z ##[group]Run unzip -o artifacts.zip 2025-12-04T12:47:34.9668406Z unzip -o artifacts.zip 2025-12-04T12:47:34.9672744Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:34.9672893Z env: 2025-12-04T12:47:34.9673148Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:34.9673287Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:34.9673465Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:34.9673632Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:34.9674131Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:34.9674629Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:34.9674744Z AWS_REGION: us-east-1 2025-12-04T12:47:34.9674907Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:34.9675062Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:34.9677155Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:34.9677262Z ##[endgroup] 2025-12-04T12:47:34.9709312Z Archive: artifacts.zip 2025-12-04T12:47:34.9710549Z creating: dist/ 2025-12-04T12:47:37.8630748Z inflating: dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T12:47:37.8709582Z inflating: dist/.ninja_log 2025-12-04T12:47:37.8709749Z creating: build/custom_test_artifacts/ 2025-12-04T12:47:37.8714343Z creating: build/custom_test_artifacts/custom-op-build/ 2025-12-04T12:47:37.8714890Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2025-12-04T12:47:37.8715416Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2025-12-04T12:47:37.8716497Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T12:47:37.8717084Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/ 2025-12-04T12:47:37.8717807Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T12:47:37.8718400Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T12:47:37.8719035Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T12:47:37.8719705Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T12:47:37.8720375Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T12:47:37.8721011Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T12:47:37.8721642Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T12:47:37.8722123Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T12:47:37.8722663Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T12:47:37.8723215Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T12:47:37.8723729Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T12:47:37.8724297Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T12:47:37.8724878Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T12:47:37.8725377Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2025-12-04T12:47:37.8725787Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2025-12-04T12:47:37.8726224Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2025-12-04T12:47:37.8726668Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2025-12-04T12:47:37.8727153Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2025-12-04T12:47:37.8727991Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2025-12-04T12:47:37.8728516Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2025-12-04T12:47:37.8729007Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2025-12-04T12:47:37.8729510Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2025-12-04T12:47:37.8730022Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2025-12-04T12:47:37.8730540Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2025-12-04T12:47:37.8731047Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2025-12-04T12:47:37.8731464Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2025-12-04T12:47:37.8735785Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2025-12-04T12:47:37.8851544Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2025-12-04T12:47:37.8851897Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2025-12-04T12:47:37.8852272Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2025-12-04T12:47:37.8852666Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2025-12-04T12:47:37.8853131Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2025-12-04T12:47:37.8853487Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2025-12-04T12:47:37.8853848Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2025-12-04T12:47:37.8854212Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2025-12-04T12:47:37.8854584Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2025-12-04T12:47:37.8854945Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2025-12-04T12:47:37.8855298Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2025-12-04T12:47:37.8865955Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2025-12-04T12:47:37.8913355Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2025-12-04T12:47:37.8913718Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T12:47:37.8914033Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2025-12-04T12:47:37.8914314Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2025-12-04T12:47:37.8914576Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2025-12-04T12:47:37.8914833Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2025-12-04T12:47:37.8915103Z inflating: build/custom_test_artifacts/custom-op-build/hipblaslt_test_outer_vec.cc 2025-12-04T12:47:37.8915358Z inflating: build/custom_test_artifacts/custom-op-build/hipblaslt_test_vec_ext.cc 2025-12-04T12:47:37.8916227Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2025-12-04T12:47:37.8916880Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2025-12-04T12:47:37.8917277Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2025-12-04T12:47:37.9017646Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2025-12-04T12:47:37.9051611Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2025-12-04T12:47:37.9051902Z creating: build/custom_test_artifacts/jit-hook-build/ 2025-12-04T12:47:37.9052171Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2025-12-04T12:47:37.9052493Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2025-12-04T12:47:37.9054038Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T12:47:37.9054395Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/ 2025-12-04T12:47:37.9054758Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T12:47:37.9055132Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T12:47:37.9055495Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T12:47:37.9055936Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T12:47:37.9056593Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T12:47:37.9056988Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T12:47:37.9057371Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T12:47:37.9057770Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T12:47:37.9058624Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T12:47:37.9059524Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T12:47:37.9059902Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T12:47:37.9060863Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T12:47:37.9061457Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T12:47:37.9061813Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2025-12-04T12:47:37.9062105Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2025-12-04T12:47:37.9062414Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2025-12-04T12:47:37.9062733Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2025-12-04T12:47:37.9063091Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2025-12-04T12:47:37.9063487Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2025-12-04T12:47:37.9063882Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2025-12-04T12:47:37.9064239Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2025-12-04T12:47:37.9064616Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2025-12-04T12:47:37.9064990Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2025-12-04T12:47:37.9065355Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2025-12-04T12:47:37.9065758Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2025-12-04T12:47:37.9066121Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2025-12-04T12:47:37.9075996Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2025-12-04T12:47:37.9112642Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2025-12-04T12:47:37.9113053Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T12:47:37.9113360Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2025-12-04T12:47:37.9113624Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2025-12-04T12:47:37.9113866Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2025-12-04T12:47:37.9114279Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2025-12-04T12:47:37.9114529Z inflating: build/custom_test_artifacts/jit-hook-build/hipblaslt_test_outer_vec.cc 2025-12-04T12:47:37.9114771Z inflating: build/custom_test_artifacts/jit-hook-build/hipblaslt_test_vec_ext.cc 2025-12-04T12:47:37.9115669Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2025-12-04T12:47:37.9115931Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2025-12-04T12:47:37.9116189Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2025-12-04T12:47:37.9139017Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2025-12-04T12:47:37.9139219Z creating: build/custom_test_artifacts/custom-backend-build/ 2025-12-04T12:47:37.9139520Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2025-12-04T12:47:37.9139752Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2025-12-04T12:47:37.9141849Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T12:47:37.9142113Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/ 2025-12-04T12:47:37.9142369Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T12:47:37.9142649Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T12:47:37.9142920Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T12:47:37.9143842Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T12:47:37.9144532Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T12:47:37.9144881Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T12:47:37.9145174Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T12:47:37.9145447Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T12:47:37.9146446Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T12:47:37.9147181Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T12:47:37.9147536Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T12:47:37.9148558Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T12:47:37.9149171Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T12:47:37.9149476Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2025-12-04T12:47:37.9149720Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2025-12-04T12:47:37.9149978Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2025-12-04T12:47:37.9150242Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2025-12-04T12:47:37.9150674Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2025-12-04T12:47:37.9151002Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2025-12-04T12:47:37.9151327Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2025-12-04T12:47:37.9151623Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2025-12-04T12:47:37.9151932Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2025-12-04T12:47:37.9152247Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2025-12-04T12:47:37.9152558Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2025-12-04T12:47:37.9152878Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2025-12-04T12:47:37.9153179Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2025-12-04T12:47:37.9154179Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2025-12-04T12:47:37.9223533Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2025-12-04T12:47:37.9223840Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2025-12-04T12:47:37.9224211Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2025-12-04T12:47:37.9224555Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2025-12-04T12:47:37.9224888Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2025-12-04T12:47:37.9225205Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2025-12-04T12:47:37.9225531Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2025-12-04T12:47:37.9225861Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2025-12-04T12:47:37.9226184Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2025-12-04T12:47:37.9226510Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2025-12-04T12:47:37.9226824Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2025-12-04T12:47:37.9238101Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2025-12-04T12:47:37.9269968Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2025-12-04T12:47:37.9270314Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T12:47:37.9270614Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2025-12-04T12:47:37.9270888Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2025-12-04T12:47:37.9271144Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2025-12-04T12:47:37.9271521Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2025-12-04T12:47:37.9271842Z inflating: build/custom_test_artifacts/custom-backend-build/hipblaslt_test_outer_vec.cc 2025-12-04T12:47:37.9272107Z inflating: build/custom_test_artifacts/custom-backend-build/hipblaslt_test_vec_ext.cc 2025-12-04T12:47:37.9272956Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2025-12-04T12:47:37.9273282Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2025-12-04T12:47:37.9273513Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2025-12-04T12:47:37.9333250Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2025-12-04T12:47:37.9356426Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2025-12-04T12:47:37.9356615Z creating: build/lib/ 2025-12-04T12:47:37.9405310Z inflating: build/lib/libprotobuf-lite.a 2025-12-04T12:47:37.9667412Z inflating: build/lib/libprotobuf.a 2025-12-04T12:47:37.9960577Z inflating: build/lib/libprotoc.a 2025-12-04T12:47:37.9966189Z inflating: build/lib/libpthreadpool.a 2025-12-04T12:47:37.9970875Z inflating: build/lib/libcpuinfo.a 2025-12-04T12:47:37.9975323Z inflating: build/lib/libcpuinfo_internals.a 2025-12-04T12:47:37.9975708Z inflating: build/lib/libclog.a 2025-12-04T12:47:37.9987060Z inflating: build/lib/libpytorch_qnnpack.a 2025-12-04T12:47:37.9988262Z inflating: build/lib/libnnpack_reference_layers.a 2025-12-04T12:47:37.9998648Z inflating: build/lib/libnnpack.a 2025-12-04T12:47:38.0109491Z inflating: build/lib/libmicrokernels-prod.a 2025-12-04T12:47:38.0631797Z inflating: build/lib/libmicrokernels-all.a 2025-12-04T12:47:38.0673094Z inflating: build/lib/libgtest.a 2025-12-04T12:47:38.0683356Z inflating: build/lib/libgmock.a 2025-12-04T12:47:38.0683700Z inflating: build/lib/libgtest_main.a 2025-12-04T12:47:38.0683894Z inflating: build/lib/libgmock_main.a 2025-12-04T12:47:38.0737769Z inflating: build/lib/libXNNPACK.a 2025-12-04T12:47:38.0782999Z inflating: build/lib/libbenchmark.a 2025-12-04T12:47:38.0783224Z inflating: build/lib/libbenchmark_main.a 2025-12-04T12:47:38.0787944Z inflating: build/lib/libittnotify.a 2025-12-04T12:47:38.0788178Z inflating: build/lib/libjitprofiling.a 2025-12-04T12:47:38.0827697Z inflating: build/lib/libasmjit.a 2025-12-04T12:47:38.1514948Z inflating: build/lib/libfbgemm.a 2025-12-04T12:47:38.1532956Z inflating: build/lib/libtensorpipe_uv.a 2025-12-04T12:47:38.1853565Z inflating: build/lib/libtensorpipe.a 2025-12-04T12:47:38.1925240Z inflating: build/lib/libgloo.a 2025-12-04T12:47:38.1952887Z inflating: build/lib/libonnx_proto.a 2025-12-04T12:47:38.2200796Z inflating: build/lib/libgloo_hip.a 2025-12-04T12:47:38.2621978Z inflating: build/lib/libonnx.a 2025-12-04T12:47:38.8608737Z inflating: build/lib/libdnnl.a 2025-12-04T12:47:38.8620043Z inflating: build/lib/libfmt.a 2025-12-04T12:47:38.8805002Z inflating: build/lib/libkineto.a 2025-12-04T12:47:38.8874705Z inflating: build/lib/libc10.so 2025-12-04T12:47:38.8875362Z inflating: build/lib/libtorch_global_deps.so 2025-12-04T12:47:38.8876271Z inflating: build/lib/libcaffe2_nvrtc.so 2025-12-04T12:47:38.8902991Z inflating: build/lib/libc10_hip.so 2025-12-04T12:47:38.9191474Z inflating: build/lib/libfbgemm_genai.a 2025-12-04T12:47:40.7517460Z inflating: build/lib/libtorch_cpu.so 2025-12-04T12:47:40.7519717Z inflating: build/lib/libshm.so 2025-12-04T12:47:41.5946039Z inflating: build/lib/libtorch_hip.so 2025-12-04T12:47:41.5946506Z inflating: build/lib/libtorch.so 2025-12-04T12:47:41.5957669Z inflating: build/lib/libjitbackend_test.so 2025-12-04T12:47:41.5999760Z inflating: build/lib/libtorchbind_test.so 2025-12-04T12:47:41.6015321Z inflating: build/lib/libbackend_with_compiler.so 2025-12-04T12:47:41.6029809Z inflating: build/lib/libaoti_custom_ops.so 2025-12-04T12:47:41.7434422Z inflating: build/lib/libtorch_python.so 2025-12-04T12:47:41.7455984Z inflating: build/lib/libnnapi_backend.so 2025-12-04T12:47:41.7456325Z creating: build/bin/ 2025-12-04T12:47:41.7456579Z creating: build/bin/CMakeFiles/ 2025-12-04T12:47:41.7456873Z inflating: build/bin/cmake_install.cmake 2025-12-04T12:47:41.7457907Z inflating: build/bin/CTestTestfile.cmake 2025-12-04T12:47:41.7730259Z inflating: build/bin/protoc-3.13.0.0 2025-12-04T12:47:41.8003833Z inflating: build/bin/protoc 2025-12-04T12:47:41.8039202Z inflating: build/bin/c10_AllocatorConfig_test 2025-12-04T12:47:41.8072767Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2025-12-04T12:47:41.8106901Z inflating: build/bin/c10_Device_test 2025-12-04T12:47:41.8145849Z inflating: build/bin/c10_DispatchKeySet_test 2025-12-04T12:47:41.8179977Z inflating: build/bin/c10_DeviceGuard_test 2025-12-04T12:47:41.8215430Z inflating: build/bin/c10_Scalar_test 2025-12-04T12:47:41.8248503Z inflating: build/bin/c10_StreamGuard_test 2025-12-04T12:47:41.8284181Z inflating: build/bin/c10_InlineDeviceGuard_test 2025-12-04T12:47:41.8321194Z inflating: build/bin/c10_SizesAndStrides_test 2025-12-04T12:47:41.8357650Z inflating: build/bin/c10_InlineStreamGuard_test 2025-12-04T12:47:41.8394756Z inflating: build/bin/c10_SymInt_test 2025-12-04T12:47:41.8427649Z inflating: build/bin/c10_ArrayRef_test 2025-12-04T12:47:41.8473222Z inflating: build/bin/c10_cow_test 2025-12-04T12:47:41.8505791Z inflating: build/bin/c10_ConstexprCrc_test 2025-12-04T12:47:41.8540876Z inflating: build/bin/c10_Bitset_test 2025-12-04T12:47:41.8573983Z inflating: build/bin/c10_DeadlockDetection_test 2025-12-04T12:47:41.8608755Z inflating: build/bin/c10_IntrusiveList_test 2025-12-04T12:47:41.8642418Z inflating: build/bin/c10_Half_test 2025-12-04T12:47:41.8679288Z inflating: build/bin/c10_LeftRight_test 2025-12-04T12:47:41.8716774Z inflating: build/bin/c10_Enumerate_test 2025-12-04T12:47:41.8751953Z inflating: build/bin/c10_NetworkFlow_test 2025-12-04T12:47:41.8785241Z inflating: build/bin/c10_Semaphore_test 2025-12-04T12:47:41.8821783Z inflating: build/bin/c10_ThreadLocal_test 2025-12-04T12:47:41.8855062Z inflating: build/bin/c10_Synchronized_test 2025-12-04T12:47:41.8889325Z inflating: build/bin/c10_TypeIndex_test 2025-12-04T12:47:41.8923592Z inflating: build/bin/c10_accumulate_test 2025-12-04T12:47:41.8956864Z inflating: build/bin/c10_bit_cast_test 2025-12-04T12:47:41.8993679Z inflating: build/bin/c10_bfloat16_test 2025-12-04T12:47:41.9029999Z inflating: build/bin/c10_complex_test 2025-12-04T12:47:41.9067216Z inflating: build/bin/c10_complex_math_test 2025-12-04T12:47:41.9100283Z inflating: build/bin/c10_error_test 2025-12-04T12:47:41.9134864Z inflating: build/bin/c10_exception_test 2025-12-04T12:47:41.9168268Z inflating: build/bin/c10_flags_test 2025-12-04T12:47:41.9201509Z inflating: build/bin/c10_generic_math_test 2025-12-04T12:47:41.9299582Z inflating: build/bin/c10_intrusive_ptr_test 2025-12-04T12:47:41.9333458Z inflating: build/bin/c10_irange_test 2025-12-04T12:47:41.9368733Z inflating: build/bin/c10_lazy_test 2025-12-04T12:47:41.9401923Z inflating: build/bin/c10_nofatal_test 2025-12-04T12:47:41.9450671Z inflating: build/bin/c10_optional_test 2025-12-04T12:47:41.9488390Z inflating: build/bin/c10_logging_test 2025-12-04T12:47:41.9528787Z inflating: build/bin/c10_ordered_preserving_dict_test 2025-12-04T12:47:41.9624862Z inflating: build/bin/c10_small_vector_test 2025-12-04T12:47:41.9660108Z inflating: build/bin/c10_registry_test 2025-12-04T12:47:41.9694099Z inflating: build/bin/c10_ssize_test 2025-12-04T12:47:41.9731376Z inflating: build/bin/c10_string_util_test 2025-12-04T12:47:41.9763869Z inflating: build/bin/c10_string_view_test 2025-12-04T12:47:41.9792899Z inflating: build/bin/c10_intrusive_ptr_benchmark 2025-12-04T12:47:41.9826144Z inflating: build/bin/c10_tempfile_test 2025-12-04T12:47:41.9863259Z inflating: build/bin/c10_typeid_test 2025-12-04T12:47:41.9895865Z inflating: build/bin/c10_hip_HIPAssertionsTest_1_var_test 2025-12-04T12:47:41.9928285Z inflating: build/bin/c10_hip_HIPAssertionsTest_catches_stream 2025-12-04T12:47:41.9960922Z inflating: build/bin/c10_hip_HIPAssertionsTest_catches_thread_and_block_and_device 2025-12-04T12:47:41.9993276Z inflating: build/bin/c10_hip_HIPAssertionsTest_from_2_processes 2025-12-04T12:47:42.0025763Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T12:47:42.0058195Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T12:47:42.0090570Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_same_block 2025-12-04T12:47:42.0123218Z inflating: build/bin/c10_hip_HIPTest 2025-12-04T12:47:42.0481161Z inflating: build/bin/vec_test_all_types_DEFAULT 2025-12-04T12:47:42.0849770Z inflating: build/bin/vec_test_all_types_AVX512 2025-12-04T12:47:42.1221173Z inflating: build/bin/vec_test_all_types_AVX2 2025-12-04T12:47:42.1283570Z inflating: build/bin/test_aoti_abi_check 2025-12-04T12:47:42.1316197Z inflating: build/bin/test_vec_half_DEFAULT 2025-12-04T12:47:42.1349174Z inflating: build/bin/test_vec_half_AVX2 2025-12-04T12:47:42.1382222Z inflating: build/bin/test_vec_half_AVX512 2025-12-04T12:47:42.1417027Z inflating: build/bin/BackoffTest 2025-12-04T12:47:42.1452026Z inflating: build/bin/FileStoreTest 2025-12-04T12:47:42.1489198Z inflating: build/bin/TCPStoreTest 2025-12-04T12:47:42.1524766Z inflating: build/bin/HashStoreTest 2025-12-04T12:47:42.1568478Z inflating: build/bin/ProcessGroupGlooTest 2025-12-04T12:47:42.1570172Z inflating: build/bin/example_allreduce 2025-12-04T12:47:42.1572084Z inflating: build/bin/torch_shm_manager 2025-12-04T12:47:42.1607900Z inflating: build/bin/static_runtime_bench 2025-12-04T12:47:42.1764127Z inflating: build/bin/static_runtime_test 2025-12-04T12:47:42.1811515Z inflating: build/bin/Dict_test 2025-12-04T12:47:42.1846192Z inflating: build/bin/Dimname_test 2025-12-04T12:47:42.1888496Z inflating: build/bin/MaybeOwned_test 2025-12-04T12:47:42.1925751Z inflating: build/bin/NamedTensor_test 2025-12-04T12:47:42.1964284Z inflating: build/bin/apply_utils_test 2025-12-04T12:47:42.2002812Z inflating: build/bin/atest 2025-12-04T12:47:42.2044403Z inflating: build/bin/basic 2025-12-04T12:47:42.2080133Z inflating: build/bin/broadcast_test 2025-12-04T12:47:42.2113628Z inflating: build/bin/cpu_allocator_test 2025-12-04T12:47:42.2151633Z inflating: build/bin/cpu_generator_test 2025-12-04T12:47:42.2186436Z inflating: build/bin/cpu_profiling_allocator_test 2025-12-04T12:47:42.2245485Z inflating: build/bin/cpu_rng_test 2025-12-04T12:47:42.2279717Z inflating: build/bin/dlconvertor_test 2025-12-04T12:47:42.2317775Z inflating: build/bin/extension_backend_test 2025-12-04T12:47:42.2354175Z inflating: build/bin/half_test 2025-12-04T12:47:42.2416629Z inflating: build/bin/ivalue_test 2025-12-04T12:47:42.2449556Z inflating: build/bin/lazy_tensor_test 2025-12-04T12:47:42.2484535Z inflating: build/bin/math_kernel_test 2025-12-04T12:47:42.2519524Z inflating: build/bin/memory_format_test 2025-12-04T12:47:42.2554835Z inflating: build/bin/memory_overlapping_test 2025-12-04T12:47:42.2590257Z inflating: build/bin/mobile_memory_cleanup 2025-12-04T12:47:42.2627046Z inflating: build/bin/native_test 2025-12-04T12:47:42.2660685Z inflating: build/bin/operator_name_test 2025-12-04T12:47:42.2694168Z inflating: build/bin/operators_test 2025-12-04T12:47:42.2728717Z inflating: build/bin/packedtensoraccessor_test 2025-12-04T12:47:42.2772788Z inflating: build/bin/pow_test 2025-12-04T12:47:42.2809797Z inflating: build/bin/quantized_test 2025-12-04T12:47:42.2843085Z inflating: build/bin/reduce_ops_test 2025-12-04T12:47:42.2877009Z inflating: build/bin/reportMemoryUsage_test 2025-12-04T12:47:42.2914179Z inflating: build/bin/scalar_tensor_test 2025-12-04T12:47:42.2951602Z inflating: build/bin/scalar_test 2025-12-04T12:47:42.2985563Z inflating: build/bin/StorageUtils_test 2025-12-04T12:47:42.3019876Z inflating: build/bin/stride_properties_test 2025-12-04T12:47:42.3071497Z inflating: build/bin/tensor_iterator_test 2025-12-04T12:47:42.3106987Z inflating: build/bin/test_parallel 2025-12-04T12:47:42.3140664Z inflating: build/bin/thread_init_test 2025-12-04T12:47:42.3176844Z inflating: build/bin/type_ptr_test 2025-12-04T12:47:42.3215698Z inflating: build/bin/type_test 2025-12-04T12:47:42.3250208Z inflating: build/bin/undefined_tensor_test 2025-12-04T12:47:42.3283102Z inflating: build/bin/verify_api_visibility 2025-12-04T12:47:42.3329247Z inflating: build/bin/legacy_vmap_test 2025-12-04T12:47:42.3363003Z inflating: build/bin/weakref_test 2025-12-04T12:47:42.3396905Z inflating: build/bin/wrapdim_test 2025-12-04T12:47:42.3430708Z inflating: build/bin/xla_tensor_test 2025-12-04T12:47:42.3469653Z inflating: build/bin/IListRef_test 2025-12-04T12:47:42.3536542Z inflating: build/bin/List_test 2025-12-04T12:47:42.3612762Z inflating: build/bin/kernel_function_legacy_test 2025-12-04T12:47:42.3655667Z inflating: build/bin/KernelFunction_test 2025-12-04T12:47:42.3720531Z inflating: build/bin/kernel_lambda_test 2025-12-04T12:47:42.3800202Z inflating: build/bin/kernel_lambda_legacy_test 2025-12-04T12:47:42.3861488Z inflating: build/bin/kernel_function_test 2025-12-04T12:47:42.3900872Z inflating: build/bin/kernel_stackbased_test 2025-12-04T12:47:42.3961622Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2025-12-04T12:47:42.3995397Z inflating: build/bin/CppSignature_test 2025-12-04T12:47:42.4031576Z inflating: build/bin/backend_fallback_test 2025-12-04T12:47:42.4064048Z inflating: build/bin/op_allowlist_test 2025-12-04T12:47:42.4254989Z inflating: build/bin/op_registration_test 2025-12-04T12:47:42.4298461Z inflating: build/bin/inline_container_test 2025-12-04T12:47:42.4330875Z inflating: build/bin/hip_complex_math_test 2025-12-04T12:47:42.4363374Z inflating: build/bin/hip_complex_test 2025-12-04T12:47:42.4398515Z inflating: build/bin/hip_apply_test 2025-12-04T12:47:42.4430950Z inflating: build/bin/hip_distributions_test 2025-12-04T12:47:42.4463382Z inflating: build/bin/hip_generator_test 2025-12-04T12:47:42.4495846Z inflating: build/bin/hip_half_test 2025-12-04T12:47:42.4528202Z inflating: build/bin/hip_integer_divider_test 2025-12-04T12:47:42.4560802Z inflating: build/bin/hip_optional_test 2025-12-04T12:47:42.4593113Z inflating: build/bin/hip_packedtensoraccessor_test 2025-12-04T12:47:42.4625917Z inflating: build/bin/hip_vectorized_test 2025-12-04T12:47:42.4660146Z inflating: build/bin/hip_dlconvertor_test 2025-12-04T12:47:42.5332313Z inflating: build/bin/test_jit 2025-12-04T12:47:42.5545890Z inflating: build/bin/test_lazy 2025-12-04T12:47:42.5582310Z inflating: build/bin/test_dist_autograd 2025-12-04T12:47:42.5627032Z inflating: build/bin/test_cpp_rpc 2025-12-04T12:47:42.6347730Z inflating: build/bin/test_api 2025-12-04T12:47:42.6348435Z inflating: build/bin/parallel_benchmark 2025-12-04T12:47:42.6348857Z creating: .additional_ci_files/ 2025-12-04T12:47:42.6387260Z inflating: .additional_ci_files/test-times.json 2025-12-04T12:47:42.6528349Z inflating: .additional_ci_files/test-class-times.json 2025-12-04T12:47:42.6556679Z ##[group]Run rm artifacts.zip 2025-12-04T12:47:42.6556897Z rm artifacts.zip 2025-12-04T12:47:42.6562290Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:42.6562479Z env: 2025-12-04T12:47:42.6562594Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:42.6562762Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:42.6562992Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:42.6563196Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:42.6563948Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:42.6564545Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:42.6564690Z AWS_REGION: us-east-1 2025-12-04T12:47:42.6564921Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:42.6565126Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:42.6567761Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:42.6567903Z ##[endgroup] 2025-12-04T12:47:42.7570665Z ##[group]Run df -H 2025-12-04T12:47:42.7570767Z df -H 2025-12-04T12:47:42.7573389Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:42.7573547Z env: 2025-12-04T12:47:42.7573642Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:42.7573775Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:42.7573955Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:42.7574121Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:42.7574624Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:42.7575119Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:42.7575242Z AWS_REGION: us-east-1 2025-12-04T12:47:42.7575383Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:42.7575535Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:42.7577664Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:42.7577897Z ##[endgroup] 2025-12-04T12:47:42.7984187Z Filesystem Size Used Avail Use% Mounted on 2025-12-04T12:47:42.7984661Z overlay 16T 349G 16T 3% / 2025-12-04T12:47:42.7985019Z tmpfs 68M 0 68M 0% /dev 2025-12-04T12:47:42.7985363Z /dev/md0 16T 349G 16T 3% /run 2025-12-04T12:47:42.7985895Z shm 68M 17k 68M 1% /dev/shm 2025-12-04T12:47:42.7986533Z amdprj2-k8s_2 5.5T 120G 5.4T 3% /home/runner/pytorch-data 2025-12-04T12:47:42.7987060Z tmpfs 3.3T 13k 3.3T 1% /run/secrets/kubernetes.io/serviceaccount 2025-12-04T12:47:42.7987575Z tmpfs 1.7T 0 1.7T 0% /proc/acpi 2025-12-04T12:47:42.7987940Z tmpfs 1.7T 0 1.7T 0% /proc/scsi 2025-12-04T12:47:42.7988303Z tmpfs 1.7T 0 1.7T 0% /sys/firmware 2025-12-04T12:47:42.7988717Z tmpfs 1.7T 0 1.7T 0% /sys/devices/virtual/powercap 2025-12-04T12:47:42.8017042Z Prepare all required actions 2025-12-04T12:47:42.8017288Z Getting action download info 2025-12-04T12:47:43.0341540Z ##[group]Run ./.github/actions/download-td-artifacts 2025-12-04T12:47:43.0341689Z with: 2025-12-04T12:47:43.0341787Z env: 2025-12-04T12:47:43.0341884Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:43.0342030Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:43.0342215Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:43.0342387Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:43.0342900Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:43.0343396Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:43.0343568Z AWS_REGION: us-east-1 2025-12-04T12:47:43.0343750Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:43.0343915Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:43.0346037Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:43.0346148Z ##[endgroup] 2025-12-04T12:47:43.0359196Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T12:47:43.0359343Z with: 2025-12-04T12:47:43.0359444Z name: td_results 2025-12-04T12:47:43.0359557Z s3-bucket: gha-artifacts 2025-12-04T12:47:43.0359673Z region: us-east-1 2025-12-04T12:47:43.0359778Z env: 2025-12-04T12:47:43.0359878Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:43.0360022Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:43.0360204Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:43.0360378Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:43.0360894Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:43.0361397Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:43.0361527Z AWS_REGION: us-east-1 2025-12-04T12:47:43.0361669Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:43.0361826Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:43.0363940Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:43.0364055Z ##[endgroup] 2025-12-04T12:47:43.2551310Z (node:17106) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T12:47:43.2551832Z 2025-12-04T12:47:43.2552063Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T12:47:43.2552633Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T12:47:43.2553217Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T12:47:43.5218861Z Found 1 objects with prefix pytorch/pytorch/19921726347/td_results/ 2025-12-04T12:47:43.5219398Z Starting download (1/1): /home/runner/_work/pytorch/pytorch/td_results.json 2025-12-04T12:47:43.9559780Z Finished download (1/1): /home/runner/_work/pytorch/pytorch/td_results.json 2025-12-04T12:47:43.9564241Z Artifact download has finished successfully 2025-12-04T12:47:43.9713706Z ##[group]Run mkdir -p .additional_ci_files 2025-12-04T12:47:43.9713952Z mkdir -p .additional_ci_files 2025-12-04T12:47:43.9714194Z mv td_results.json .additional_ci_files/td_results.json || true 2025-12-04T12:47:43.9719861Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:43.9720032Z env: 2025-12-04T12:47:43.9720143Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:43.9720307Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:43.9720511Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:43.9720700Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:43.9721415Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:43.9721969Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:43.9722100Z AWS_REGION: us-east-1 2025-12-04T12:47:43.9722382Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:43.9722569Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:43.9724869Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:43.9724993Z ##[endgroup] 2025-12-04T12:47:43.9793343Z ##[group]Run .github/scripts/parse_ref.py 2025-12-04T12:47:43.9793533Z .github/scripts/parse_ref.py 2025-12-04T12:47:43.9798710Z shell: /usr/bin/bash -e {0} 2025-12-04T12:47:43.9798825Z env: 2025-12-04T12:47:43.9798920Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:43.9799061Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:43.9799243Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:43.9799428Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:43.9799936Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:43.9800427Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:43.9800544Z AWS_REGION: us-east-1 2025-12-04T12:47:43.9800743Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:43.9800898Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:43.9803005Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:43.9803115Z ##[endgroup] 2025-12-04T12:47:43.9916886Z Setting output branch=main 2025-12-04T12:47:43.9986698Z Prepare all required actions 2025-12-04T12:47:43.9986918Z Getting action download info 2025-12-04T12:47:44.2461122Z ##[group]Run ./.github/actions/filter-test-configs 2025-12-04T12:47:44.2461278Z with: 2025-12-04T12:47:44.2461490Z github-token: *** 2025-12-04T12:47:44.2462677Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}]} 2025-12-04T12:47:44.2464137Z job-name: linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:44.2464341Z env: 2025-12-04T12:47:44.2464442Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:44.2464587Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:44.2464767Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:44.2464936Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:44.2465445Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:44.2465941Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:44.2466069Z AWS_REGION: us-east-1 2025-12-04T12:47:44.2466200Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:44.2466357Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:44.2468549Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:44.2468658Z ##[endgroup] 2025-12-04T12:47:44.2483776Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T12:47:44.2483903Z with: 2025-12-04T12:47:44.2483994Z shell: bash 2025-12-04T12:47:44.2484089Z timeout_minutes: 10 2025-12-04T12:47:44.2484192Z max_attempts: 5 2025-12-04T12:47:44.2484290Z retry_wait_seconds: 30 2025-12-04T12:47:44.2484576Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T12:47:44.2484881Z polling_interval_seconds: 1 2025-12-04T12:47:44.2484994Z warning_on_retry: true 2025-12-04T12:47:44.2485100Z continue_on_error: false 2025-12-04T12:47:44.2485207Z env: 2025-12-04T12:47:44.2485299Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:44.2485439Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:44.2485616Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:44.2485780Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:44.2486274Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:44.2486757Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:44.2486874Z AWS_REGION: us-east-1 2025-12-04T12:47:44.2487003Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:44.2487303Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:44.2489436Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:44.2489584Z GITHUB_TOKEN: *** 2025-12-04T12:47:44.2489682Z ##[endgroup] 2025-12-04T12:47:44.2895127Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T12:47:44.4298695Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T12:47:44.5262193Z Collecting requests==2.27.1 2025-12-04T12:47:44.5630972Z Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB) 2025-12-04T12:47:44.5729357Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.1/63.1 KB 7.0 MB/s eta 0:00:00 2025-12-04T12:47:44.6210220Z Collecting pyyaml==6.0.2 2025-12-04T12:47:44.6276097Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) 2025-12-04T12:47:44.6469391Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 KB 40.9 MB/s eta 0:00:00 2025-12-04T12:47:44.6663070Z Collecting idna<4,>=2.5 2025-12-04T12:47:44.6717803Z Downloading idna-3.11-py3-none-any.whl (71 kB) 2025-12-04T12:47:44.6732470Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.0/71.0 KB 109.0 MB/s eta 0:00:00 2025-12-04T12:47:44.7642145Z Collecting charset-normalizer~=2.0.0 2025-12-04T12:47:44.7706793Z Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB) 2025-12-04T12:47:44.7908499Z Collecting certifi>=2017.4.17 2025-12-04T12:47:44.8007333Z Downloading certifi-2025.11.12-py3-none-any.whl (159 kB) 2025-12-04T12:47:44.8023717Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.4/159.4 KB 232.1 MB/s eta 0:00:00 2025-12-04T12:47:44.8292429Z Collecting urllib3<1.27,>=1.21.1 2025-12-04T12:47:44.8390883Z Downloading urllib3-1.26.20-py2.py3-none-any.whl (144 kB) 2025-12-04T12:47:44.8407633Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.2/144.2 KB 220.8 MB/s eta 0:00:00 2025-12-04T12:47:44.8958575Z Installing collected packages: urllib3, pyyaml, idna, charset-normalizer, certifi, requests 2025-12-04T12:47:44.9880860Z WARNING: The script normalizer is installed in '/home/runner/.local/bin' which is not on PATH. 2025-12-04T12:47:44.9881749Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T12:47:45.0047768Z Successfully installed certifi-2025.11.12 charset-normalizer-2.0.12 idna-3.11 pyyaml-6.0.2 requests-2.27.1 urllib3-1.26.20 2025-12-04T12:47:45.2892927Z Command completed after 1 attempt(s). 2025-12-04T12:47:45.2949450Z ##[group]Run set -x 2025-12-04T12:47:45.2949621Z set -x 2025-12-04T12:47:45.2949743Z  2025-12-04T12:47:45.2949939Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T12:47:45.2950172Z # in runner workspace 2025-12-04T12:47:45.2950361Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2025-12-04T12:47:45.2955687Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:45.2955885Z env: 2025-12-04T12:47:45.2956017Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:45.2956197Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:45.2956431Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:45.2956640Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:45.2957271Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:45.2957947Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:45.2958096Z AWS_REGION: us-east-1 2025-12-04T12:47:45.2958311Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:45.2958516Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:45.2960820Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:45.2960935Z ##[endgroup] 2025-12-04T12:47:45.2984824Z + python3 /home/runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2025-12-04T12:47:45.3074858Z Setting output branch=main 2025-12-04T12:47:45.3114243Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T12:47:45.3114490Z echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T12:47:45.3114687Z echo "Job name: ${JOB_NAME}" 2025-12-04T12:47:45.3114850Z  2025-12-04T12:47:45.3115053Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T12:47:45.3115296Z # in runner workspace 2025-12-04T12:47:45.3115521Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2025-12-04T12:47:45.3115769Z  --workflow "${GITHUB_WORKFLOW}" \ 2025-12-04T12:47:45.3115948Z  --job-name "${JOB_NAME}" \ 2025-12-04T12:47:45.3117544Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}]}" \ 2025-12-04T12:47:45.3119361Z  --selected-test-configs "" \ 2025-12-04T12:47:45.3119517Z  --pr-number "${PR_NUMBER}" \ 2025-12-04T12:47:45.3119655Z  --tag "${TAG}" \ 2025-12-04T12:47:45.3119787Z  --event-name "${EVENT_NAME}" \ 2025-12-04T12:47:45.3119922Z  --schedule "${SCHEDULE}" \ 2025-12-04T12:47:45.3120059Z  --branch "${HEAD_BRANCH}" 2025-12-04T12:47:45.3124748Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:45.3124908Z env: 2025-12-04T12:47:45.3125011Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:45.3125157Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:45.3125341Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:45.3125518Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:45.3126046Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:45.3126558Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:45.3126685Z AWS_REGION: us-east-1 2025-12-04T12:47:45.3126877Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:45.3127052Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:45.3129288Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:45.3129502Z GITHUB_TOKEN: *** 2025-12-04T12:47:45.3129688Z JOB_NAME: linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:45.3129884Z PR_NUMBER: 2025-12-04T12:47:45.3129982Z TAG: 2025-12-04T12:47:45.3130074Z EVENT_NAME: push 2025-12-04T12:47:45.3130177Z SCHEDULE: 2025-12-04T12:47:45.3130273Z HEAD_BRANCH: main 2025-12-04T12:47:45.3130378Z ##[endgroup] 2025-12-04T12:47:45.3153641Z Workflow: trunk-rocm-mi300 2025-12-04T12:47:45.3154080Z Job name: linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:46.0072474Z INFO:root:Issue https://github.com/pytorch/pytorch/issues/167616 created by jithunnair-amd has unstable all the test jobs for trunk-rocm-mi300 / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:46.0368718Z Setting output keep-going=True 2025-12-04T12:47:46.0369167Z Setting output ci-verbose-test-logs=False 2025-12-04T12:47:46.0369532Z Setting output ci-test-showlocals=False 2025-12-04T12:47:46.0369895Z Setting output ci-no-test-timeout=False 2025-12-04T12:47:46.0370254Z Setting output ci-no-td=False 2025-12-04T12:47:46.0370593Z Setting output ci-td-distributed=False 2025-12-04T12:47:46.0370949Z Setting output is-unstable=True 2025-12-04T12:47:46.0371287Z Setting output reenabled-issues= 2025-12-04T12:47:46.0374565Z Setting output test-matrix={"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}]} 2025-12-04T12:47:46.0378443Z Setting output is-test-matrix-empty=False 2025-12-04T12:47:46.0458340Z ##[group]Run echo "Filtered matrix:" 2025-12-04T12:47:46.0458560Z echo "Filtered matrix:" 2025-12-04T12:47:46.0459946Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "unstable": "unstable"}]}" 2025-12-04T12:47:46.0461245Z  2025-12-04T12:47:46.0461350Z echo 2025-12-04T12:47:46.0461474Z echo "Is the current job unstable? True" 2025-12-04T12:47:46.0461625Z  2025-12-04T12:47:46.0461718Z echo 2025-12-04T12:47:46.0461835Z echo "Is keep-going label set? True" 2025-12-04T12:47:46.0461965Z  2025-12-04T12:47:46.0462063Z echo 2025-12-04T12:47:46.0462170Z echo "Reenabled issues? " 2025-12-04T12:47:46.0467033Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:46.0467194Z env: 2025-12-04T12:47:46.0467298Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:46.0467447Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:46.0467699Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:46.0467881Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:46.0468547Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:46.0469063Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:46.0469207Z AWS_REGION: us-east-1 2025-12-04T12:47:46.0469389Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:46.0469609Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:46.0471719Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:46.0471834Z ##[endgroup] 2025-12-04T12:47:46.0494624Z Filtered matrix: 2025-12-04T12:47:46.0495816Z {include: [{config: default, shard: 1, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: default, shard: 2, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: default, shard: 3, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: default, shard: 4, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: default, shard: 5, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: default, shard: 6, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, unstable: unstable}, {config: distributed, shard: 1, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, unstable: unstable}, {config: distributed, shard: 2, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, unstable: unstable}, {config: distributed, shard: 3, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, unstable: unstable}]} 2025-12-04T12:47:46.0497162Z 2025-12-04T12:47:46.0497228Z Is the current job unstable? True 2025-12-04T12:47:46.0497310Z 2025-12-04T12:47:46.0497364Z Is keep-going label set? True 2025-12-04T12:47:46.0497438Z 2025-12-04T12:47:46.0497534Z Reenabled issues? 2025-12-04T12:47:46.0519560Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T12:47:46.0519775Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T12:47:46.0522361Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:46.0522511Z env: 2025-12-04T12:47:46.0522611Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:46.0522750Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:46.0522937Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:46.0523108Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:46.0523618Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:46.0524148Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:46.0524273Z AWS_REGION: us-east-1 2025-12-04T12:47:46.0524419Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:46.0524581Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:46.0526682Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:46.0526795Z JOB_TIMEOUT: 300 2025-12-04T12:47:46.0526902Z ##[endgroup] 2025-12-04T12:47:46.0549829Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:47:46.0550039Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:47:46.0550222Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T12:47:46.0552686Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T12:47:46.0552836Z env: 2025-12-04T12:47:46.0552935Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:46.0553078Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:46.0553261Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:46.0553432Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:46.0553955Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:46.0554451Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:46.0554578Z AWS_REGION: us-east-1 2025-12-04T12:47:46.0554717Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:46.0554876Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:46.0556986Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:46.0557097Z ##[endgroup] 2025-12-04T12:47:46.0653170Z ##[group]Run set -x 2025-12-04T12:47:46.0653331Z set -x 2025-12-04T12:47:46.0653429Z  2025-12-04T12:47:46.0653544Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2025-12-04T12:47:46.0653705Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2025-12-04T12:47:46.0653861Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2025-12-04T12:47:46.0654006Z  TEST_COMMAND=.ci/caffe2/test.sh 2025-12-04T12:47:46.0654126Z else 2025-12-04T12:47:46.0654231Z  TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T12:47:46.0654350Z fi 2025-12-04T12:47:46.0654437Z  2025-12-04T12:47:46.0654574Z # detached container should get cleaned up by teardown_ec2_linux 2025-12-04T12:47:46.0654927Z # TODO: Stop building test binaries as part of the build phase 2025-12-04T12:47:46.0655106Z # Used for GPU_FLAG since that doesn't play nice 2025-12-04T12:47:46.0655333Z # shellcheck disable=SC2086,SC2090 2025-12-04T12:47:46.0655468Z container_name=$(docker run \ 2025-12-04T12:47:46.0655598Z  ${GPU_FLAG:-} \ 2025-12-04T12:47:46.0655721Z  -e BUILD_ENVIRONMENT \ 2025-12-04T12:47:46.0655847Z  -e PR_NUMBER \ 2025-12-04T12:47:46.0655963Z  -e GITHUB_ACTIONS \ 2025-12-04T12:47:46.0656081Z  -e GITHUB_REPOSITORY \ 2025-12-04T12:47:46.0656201Z  -e GITHUB_WORKFLOW \ 2025-12-04T12:47:46.0656319Z  -e GITHUB_JOB \ 2025-12-04T12:47:46.0656433Z  -e GITHUB_RUN_ID \ 2025-12-04T12:47:46.0656553Z  -e GITHUB_RUN_NUMBER \ 2025-12-04T12:47:46.0656670Z  -e GITHUB_RUN_ATTEMPT \ 2025-12-04T12:47:46.0656788Z  -e JOB_ID \ 2025-12-04T12:47:46.0656899Z  -e JOB_NAME \ 2025-12-04T12:47:46.0657006Z  -e BASE_SHA \ 2025-12-04T12:47:46.0657113Z  -e BRANCH \ 2025-12-04T12:47:46.0657218Z  -e SHA1 \ 2025-12-04T12:47:46.0657325Z  -e AWS_DEFAULT_REGION \ 2025-12-04T12:47:46.0657444Z  -e IN_WHEEL_TEST \ 2025-12-04T12:47:46.0657607Z  -e SHARD_NUMBER \ 2025-12-04T12:47:46.0657723Z  -e TEST_CONFIG \ 2025-12-04T12:47:46.0657834Z  -e NUM_TEST_SHARDS \ 2025-12-04T12:47:46.0657957Z  -e REENABLED_ISSUES \ 2025-12-04T12:47:46.0658084Z  -e CONTINUE_THROUGH_ERROR \ 2025-12-04T12:47:46.0658213Z  -e VERBOSE_TEST_LOGS \ 2025-12-04T12:47:46.0658336Z  -e TEST_SHOWLOCALS \ 2025-12-04T12:47:46.0658451Z  -e NO_TEST_TIMEOUT \ 2025-12-04T12:47:46.0658560Z  -e NO_TD \ 2025-12-04T12:47:46.0658676Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2025-12-04T12:47:46.0658825Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2025-12-04T12:47:46.0658970Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2025-12-04T12:47:46.0659106Z  -e TESTS_TO_INCLUDE \ 2025-12-04T12:47:46.0659225Z  -e HUGGING_FACE_HUB_TOKEN \ 2025-12-04T12:47:46.0659348Z  -e DASHBOARD_TAG \ 2025-12-04T12:47:46.0659492Z  --env-file="${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T12:47:46.0659650Z  --ulimit stack=10485760:83886080 \ 2025-12-04T12:47:46.0659776Z  --ulimit core=0 \ 2025-12-04T12:47:46.0659910Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T12:47:46.0660063Z  --security-opt seccomp=unconfined \ 2025-12-04T12:47:46.0660200Z  --cap-add=SYS_PTRACE \ 2025-12-04T12:47:46.0660322Z  --shm-size="8g" \ 2025-12-04T12:47:46.0660428Z  --tty \ 2025-12-04T12:47:46.0660525Z  --detach \ 2025-12-04T12:47:46.0660633Z  --name="${container_name}" \ 2025-12-04T12:47:46.0660763Z  --user jenkins \ 2025-12-04T12:47:46.0660906Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2025-12-04T12:47:46.0661058Z  -w /var/lib/jenkins/workspace \ 2025-12-04T12:47:46.0661245Z  "${DOCKER_IMAGE}" 2025-12-04T12:47:46.0661349Z ) 2025-12-04T12:47:46.0661458Z # save container name for later step 2025-12-04T12:47:46.0661626Z echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV" 2025-12-04T12:47:46.0661894Z # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home 2025-12-04T12:47:46.0662240Z docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}" 2025-12-04T12:47:46.0666859Z shell: /usr/bin/bash -e {0} 2025-12-04T12:47:46.0666973Z env: 2025-12-04T12:47:46.0667069Z GIT_DEFAULT_BRANCH: main 2025-12-04T12:47:46.0667247Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T12:47:46.0667428Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T12:47:46.0667637Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T12:47:46.0668145Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T12:47:46.0668636Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T12:47:46.0668749Z AWS_REGION: us-east-1 2025-12-04T12:47:46.0668923Z AWS_ACCESS_KEY_ID: *** 2025-12-04T12:47:46.0669076Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T12:47:46.0671188Z AWS_SESSION_TOKEN: *** 2025-12-04T12:47:46.0671311Z BUILD_ENVIRONMENT: linux-jammy-rocm-py3.10 2025-12-04T12:47:46.0671438Z PR_NUMBER: 2025-12-04T12:47:46.0671538Z GITHUB_REPOSITORY: pytorch/pytorch 2025-12-04T12:47:46.0671674Z GITHUB_WORKFLOW: trunk-rocm-mi300 2025-12-04T12:47:46.0671793Z GITHUB_JOB: test 2025-12-04T12:47:46.0671894Z GITHUB_RUN_ID: 19921726347 2025-12-04T12:47:46.0672005Z GITHUB_RUN_NUMBER: 688 2025-12-04T12:47:46.0672111Z GITHUB_RUN_ATTEMPT: 1 2025-12-04T12:47:46.0672212Z JOB_ID: 57113808223 2025-12-04T12:47:46.0672394Z JOB_NAME: linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:46.0672587Z BRANCH: main 2025-12-04T12:47:46.0672697Z SHA1: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:46.0672854Z BASE_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:46.0672996Z TEST_CONFIG: distributed 2025-12-04T12:47:46.0673104Z SHARD_NUMBER: 3 2025-12-04T12:47:46.0673203Z NUM_TEST_SHARDS: 3 2025-12-04T12:47:46.0673300Z REENABLED_ISSUES: 2025-12-04T12:47:46.0673405Z CONTINUE_THROUGH_ERROR: True 2025-12-04T12:47:46.0673518Z VERBOSE_TEST_LOGS: False 2025-12-04T12:47:46.0673628Z TEST_SHOWLOCALS: False 2025-12-04T12:47:46.0673731Z NO_TEST_TIMEOUT: False 2025-12-04T12:47:46.0673835Z NO_TD: False 2025-12-04T12:47:46.0674105Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:47:46.0674403Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2025-12-04T12:47:46.0674536Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 0 2025-12-04T12:47:46.0674662Z TESTS_TO_INCLUDE: 2025-12-04T12:47:46.0674765Z DASHBOARD_TAG: 2025-12-04T12:47:46.0674914Z HUGGING_FACE_HUB_TOKEN: *** 2025-12-04T12:47:46.0675033Z ##[endgroup] 2025-12-04T12:47:46.0691891Z + [[ distributed == \m\u\l\t\i\g\p\u ]] 2025-12-04T12:47:46.0692047Z + [[ linux-jammy-rocm-py3.10 == *onnx* ]] 2025-12-04T12:47:46.0692197Z + TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T12:47:46.0698142Z +++ nproc --ignore=2 2025-12-04T12:47:46.0704891Z ++ docker run --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e TEST_SHOWLOCALS -e NO_TEST_TIMEOUT -e NO_TD -e MAX_JOBS=126 -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e TESTS_TO_INCLUDE -e HUGGING_FACE_HUB_TOKEN -e DASHBOARD_TAG --env-file=/home/runner/_work/_temp/github_env_19921726347 --ulimit stack=10485760:83886080 --ulimit core=0 --env-file=/tmp/github_env_19921726347 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=8g --tty --detach --name= --user jenkins -v /home/runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T12:47:46.2344543Z + container_name=d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T12:47:46.2344891Z + echo CONTAINER_NAME=d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T12:47:46.2345962Z + docker exec -t d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 sh -c 'cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && .ci/pytorch/test.sh' 2025-12-04T12:47:49.2048986Z Processing ./dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T12:47:49.7415649Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (3.18.0) 2025-12-04T12:47:49.7417758Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (4.12.2) 2025-12-04T12:47:49.7419296Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (1.13.3) 2025-12-04T12:47:49.7420026Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (2.8.8) 2025-12-04T12:47:49.7420860Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (3.1.6) 2025-12-04T12:47:49.7422686Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (2025.10.0) 2025-12-04T12:47:49.7585916Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch==2.10.0a0+gitffd9b0f) (1.3.0) 2025-12-04T12:47:49.7611242Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.10.0a0+gitffd9b0f) (3.0.3) 2025-12-04T12:47:49.9427333Z Installing collected packages: torch 2025-12-04T12:47:55.1938959Z Successfully installed torch-2.10.0a0+gitffd9b0f 2025-12-04T12:47:55.2325164Z + export TERM=vt100 2025-12-04T12:47:55.2325477Z + TERM=vt100 2025-12-04T12:47:55.2331630Z ++ dirname .ci/pytorch/test.sh 2025-12-04T12:47:55.2342055Z + source .ci/pytorch/common.sh 2025-12-04T12:47:55.2346266Z +++ dirname .ci/pytorch/common.sh 2025-12-04T12:47:55.2355734Z ++ source .ci/pytorch/common_utils.sh 2025-12-04T12:47:55.2357377Z +++ declare -f -t trap_add 2025-12-04T12:47:55.2363289Z ++ set -ex -o pipefail 2025-12-04T12:47:55.2363541Z ++ [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T12:47:55.2363799Z ++ unset HIP_PLATFORM 2025-12-04T12:47:55.2364015Z ++ export PYTORCH_TEST_WITH_ROCM=1 2025-12-04T12:47:55.2364254Z ++ PYTORCH_TEST_WITH_ROCM=1 2025-12-04T12:47:55.2364482Z ++ BUILD_TEST_LIBTORCH=0 2025-12-04T12:47:55.2368179Z ++ dirname .ci/pytorch/test.sh 2025-12-04T12:47:55.2373332Z + source .ci/pytorch/common-build.sh 2025-12-04T12:47:55.2375018Z ++ [[ linux-jammy-rocm-py3.10 != *win-* ]] 2025-12-04T12:47:55.2384085Z ++++ dirname .ci/pytorch/common-build.sh 2025-12-04T12:47:55.2394637Z +++ cd .ci/pytorch 2025-12-04T12:47:55.2395178Z +++ pwd -P 2025-12-04T12:47:55.2397968Z ++ script_dir=/var/lib/jenkins/pytorch/.ci/pytorch 2025-12-04T12:47:55.2398399Z ++ [[ linux-jammy-rocm-py3.10 == *-pch* ]] 2025-12-04T12:47:55.2398702Z ++ which sccache 2025-12-04T12:47:55.2413844Z ++ [[ -z '' ]] 2025-12-04T12:47:55.2414040Z ++ unset SCCACHE_BUCKET 2025-12-04T12:47:55.2414255Z ++ unset SCCACHE_REGION 2025-12-04T12:47:55.2414465Z ++ sccache --stop-server 2025-12-04T12:47:55.2433890Z ++ true 2025-12-04T12:47:55.2434080Z ++ rm -f /var/lib/jenkins/sccache_error.log 2025-12-04T12:47:55.2442996Z ++ trap_add sccache_epilogue EXIT 2025-12-04T12:47:55.2443219Z ++ trap_add_cmd=sccache_epilogue 2025-12-04T12:47:55.2443702Z ++ shift 2025-12-04T12:47:55.2443864Z ++ for trap_add_name in "$@" 2025-12-04T12:47:55.2449669Z ++++ trap -p EXIT 2025-12-04T12:47:55.2451590Z +++ eval 'extract_trap_cmd ' 2025-12-04T12:47:55.2451779Z ++++ extract_trap_cmd 2025-12-04T12:47:55.2451956Z ++++ printf '%s\n' '' 2025-12-04T12:47:55.2452140Z +++ printf '%s\n' sccache_epilogue 2025-12-04T12:47:55.2454314Z ++ trap -- ' 2025-12-04T12:47:55.2454482Z sccache_epilogue' EXIT 2025-12-04T12:47:55.2454762Z ++ [[ -n '' ]] 2025-12-04T12:47:55.2454943Z ++ [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T12:47:55.2455202Z ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T12:47:55.2455446Z ++ SCCACHE_IDLE_TIMEOUT=0 2025-12-04T12:47:55.2455642Z ++ sccache --start-server 2025-12-04T12:47:55.2470462Z sccache: Starting the server... 2025-12-04T12:47:55.2700896Z sccache: Listening on address 127.0.0.1:4226 2025-12-04T12:47:55.2710163Z ++ sccache --zero-stats 2025-12-04T12:47:55.2723578Z Statistics zeroed. 2025-12-04T12:47:55.2726858Z ++ which ccache 2025-12-04T12:47:55.2734357Z + [[ linux-jammy-rocm-py3.10 != *rocm* ]] 2025-12-04T12:47:55.2734497Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T12:47:55.2734633Z + echo 'Environment variables:' 2025-12-04T12:47:55.2734759Z Environment variables: 2025-12-04T12:47:55.2734866Z + env 2025-12-04T12:47:55.2740497Z GITHUB_WORKSPACE=/home/runner/_work/pytorch/pytorch 2025-12-04T12:47:55.2740659Z CONTINUE_THROUGH_ERROR=True 2025-12-04T12:47:55.2740789Z BUILD_ENVIRONMENT=linux-jammy-rocm-py3.10 2025-12-04T12:47:55.2740958Z HOSTNAME=linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8 2025-12-04T12:47:55.2741204Z GITHUB_PATH=/home/runner/_work/_temp/_runner_file_commands/add_path_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2741414Z GITHUB_ACTION=__run_2 2025-12-04T12:47:55.2741530Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T12:47:55.2741656Z GITHUB_RUN_NUMBER=688 2025-12-04T12:47:55.2741766Z TEST_CONFIG=distributed 2025-12-04T12:47:55.2741913Z RUNNER_NAME=linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8 2025-12-04T12:47:55.2742077Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T12:47:55.2742208Z AWS_DEFAULT_REGION=us-east-1 2025-12-04T12:47:55.2742357Z RUNNER_ARTIFACT_DIR=/home/runner/_work/_temp/artifacts 2025-12-04T12:47:55.2742513Z GITHUB_TRIGGERING_ACTOR=pytorchmergebot 2025-12-04T12:47:55.2742642Z GITHUB_REF_TYPE=branch 2025-12-04T12:47:55.2742767Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2743060Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T12:47:55.2743458Z *** 2025-12-04T12:47:55.2743556Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T12:47:55.2743673Z GITHUB_ACTIONS=true 2025-12-04T12:47:55.2743795Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2743947Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2744171Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk-rocm-mi300.yml@refs/heads/main 2025-12-04T12:47:55.2744370Z UCC_HOME=/usr 2025-12-04T12:47:55.2744476Z RUNNER_ENVIRONMENT=self-hosted 2025-12-04T12:47:55.2744599Z VERBOSE_TEST_LOGS=False 2025-12-04T12:47:55.2744709Z GITHUB_REF=refs/heads/main 2025-12-04T12:47:55.2744820Z RUNNER_OS=Linux 2025-12-04T12:47:55.2744932Z SHARD_NUMBER=3 2025-12-04T12:47:55.2745037Z GITHUB_REF_PROTECTED=true 2025-12-04T12:47:55.2745235Z RUNNER_MANUALLY_TRAP_SIG=1 2025-12-04T12:47:55.2745347Z HOME=/var/lib/jenkins 2025-12-04T12:47:55.2745473Z GITHUB_API_URL=https://api.github.com 2025-12-04T12:47:55.2745614Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2025-12-04T12:47:55.2745752Z RUNNER_DOCS_DIR=/home/runner/_work/_temp/docs 2025-12-04T12:47:55.2745884Z LANG=C.UTF-8 2025-12-04T12:47:55.2746005Z UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e 2025-12-04T12:47:55.2746148Z PYTORCH_TEST_WITH_ROCM=1 2025-12-04T12:47:55.2746294Z RUNNER_TRACKING_ID=github_abdf52a5-e903-48b0-b8ad-c48a5c6d512a 2025-12-04T12:47:55.2746452Z RUNNER_ARCH=X64 2025-12-04T12:47:55.2746561Z RUNNER_TEMP=/home/runner/_work/_temp 2025-12-04T12:47:55.2746683Z NUM_TEST_SHARDS=3 2025-12-04T12:47:55.2746781Z UCX_HOME=/usr 2025-12-04T12:47:55.2747017Z GITHUB_STATE=/home/runner/_work/_temp/_runner_file_commands/save_state_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2747314Z JOB_NAME=linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:55.2747581Z MAGMA_HOME=/opt/rocm/magma 2025-12-04T12:47:55.2747776Z GITHUB_ENV=/home/runner/_work/_temp/_runner_file_commands/set_env_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2748023Z GITHUB_EVENT_PATH=/home/runner/_work/_temp/_github_workflow/event.json 2025-12-04T12:47:55.2748189Z GITHUB_EVENT_NAME=push 2025-12-04T12:47:55.2748350Z GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT=actions-runner-controller/0.12.1 2025-12-04T12:47:55.2748521Z DASHBOARD_TAG= 2025-12-04T12:47:55.2748622Z GITHUB_RUN_ID=19921726347 2025-12-04T12:47:55.2748834Z GITHUB_STEP_SUMMARY=/home/runner/_work/_temp/_runner_file_commands/step_summary_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2749062Z GITHUB_ACTOR=pytorchmergebot 2025-12-04T12:47:55.2749183Z PR_NUMBER= 2025-12-04T12:47:55.2749281Z GITHUB_RUN_ATTEMPT=1 2025-12-04T12:47:55.2749394Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T12:47:55.2749550Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T12:47:55.2749687Z TERM=vt100 2025-12-04T12:47:55.2749787Z INSTALLED_VISION=yes 2025-12-04T12:47:55.2749893Z BRANCH=main 2025-12-04T12:47:55.2749994Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T12:47:55.2750110Z TESTS_TO_INCLUDE= 2025-12-04T12:47:55.2750274Z GITHUB_ACTION_PATH=/home/runner/_work/pytorch/pytorch/./.github/actions/setup-rocm 2025-12-04T12:47:55.2750465Z GITHUB_SERVER_URL=https://github.com 2025-12-04T12:47:55.2750608Z PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100 2025-12-04T12:47:55.2750763Z UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77 2025-12-04T12:47:55.2750903Z REENABLED_ISSUES= 2025-12-04T12:47:55.2751002Z SHLVL=1 2025-12-04T12:47:55.2751095Z MAX_JOBS=126 2025-12-04T12:47:55.2751229Z RUNNER_TEST_RESULTS_DIR=/home/runner/_work/_temp/test-results 2025-12-04T12:47:55.2751393Z GITHUB_ACTOR_ID=97764156 2025-12-04T12:47:55.2751514Z RUNNER_TOOL_CACHE=/home/runner/_work/_tool 2025-12-04T12:47:55.2751680Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2751836Z GITHUB_REF_NAME=main 2025-12-04T12:47:55.2751947Z ROCM_PATH=/opt/rocm 2025-12-04T12:47:55.2752049Z GITHUB_JOB=test 2025-12-04T12:47:55.2752150Z NO_TEST_TIMEOUT=False 2025-12-04T12:47:55.2752267Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T12:47:55.2752384Z LC_ALL=C.UTF-8 2025-12-04T12:47:55.2752485Z GITHUB_RETENTION_DAYS=90 2025-12-04T12:47:55.2752609Z RUNNER_WORKSPACE=/home/runner/_work/pytorch 2025-12-04T12:47:55.2752743Z OPENSSL_DIR=/opt/openssl 2025-12-04T12:47:55.2752859Z GITHUB_ACTION_REPOSITORY= 2025-12-04T12:47:55.2753219Z PATH=/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T12:47:55.2753575Z GITHUB_BASE_REF= 2025-12-04T12:47:55.2753678Z CI=true 2025-12-04T12:47:55.2753777Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T12:47:55.2753896Z JOB_ID=57113808223 2025-12-04T12:47:55.2753996Z GITHUB_HEAD_REF= 2025-12-04T12:47:55.2754097Z GITHUB_ACTION_REF= 2025-12-04T12:47:55.2754251Z TEST_SHOWLOCALS=False 2025-12-04T12:47:55.2754370Z GITHUB_WORKFLOW=trunk-rocm-mi300 2025-12-04T12:47:55.2754499Z DEBIAN_FRONTEND=noninteractive 2025-12-04T12:47:55.2754713Z GITHUB_OUTPUT=/home/runner/_work/_temp/_runner_file_commands/set_output_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2754928Z NO_TD=False 2025-12-04T12:47:55.2755026Z OLDPWD=/var/lib/jenkins 2025-12-04T12:47:55.2755138Z _=/usr/bin/env 2025-12-04T12:47:55.2755272Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2025-12-04T12:47:55.2803990Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2025-12-04T12:47:55.2804223Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T12:47:55.2804504Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2025-12-04T12:47:55.2804725Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2025-12-04T12:47:55.2804899Z + BUILD_DIR=build 2025-12-04T12:47:55.2805012Z + BUILD_RENAMED_DIR=build_renamed 2025-12-04T12:47:55.2805134Z + BUILD_BIN_DIR=build/bin 2025-12-04T12:47:55.2805250Z + SHARD_NUMBER=3 2025-12-04T12:47:55.2805349Z + NUM_TEST_SHARDS=3 2025-12-04T12:47:55.2805461Z + export TORCH_SERIALIZATION_DEBUG=1 2025-12-04T12:47:55.2805590Z + TORCH_SERIALIZATION_DEBUG=1 2025-12-04T12:47:55.2805708Z + export VALGRIND=ON 2025-12-04T12:47:55.2805812Z + VALGRIND=ON 2025-12-04T12:47:55.2805925Z + [[ linux-jammy-rocm-py3.10 == *clang9* ]] 2025-12-04T12:47:55.2806065Z + [[ linux-jammy-rocm-py3.10 == *xpu* ]] 2025-12-04T12:47:55.2806195Z + detect_cuda_arch 2025-12-04T12:47:55.2806309Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T12:47:55.2806449Z + [[ linux-jammy-rocm-py3.10 == *s390x* ]] 2025-12-04T12:47:55.2806581Z + [[ 0 == \1 ]] 2025-12-04T12:47:55.2806679Z + [[ True == \1 ]] 2025-12-04T12:47:55.2806792Z + [[ linux-jammy-rocm-py3.10 != *bazel* ]] 2025-12-04T12:47:55.2807902Z ++ realpath build/custom_test_artifacts 2025-12-04T12:47:55.2813661Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/pytorch/build/custom_test_artifacts 2025-12-04T12:47:55.2813909Z + [[ -n '' ]] 2025-12-04T12:47:55.2814041Z + echo 'Environment variables' 2025-12-04T12:47:55.2814187Z Environment variables 2025-12-04T12:47:55.2814303Z + env 2025-12-04T12:47:55.2819201Z GITHUB_WORKSPACE=/home/runner/_work/pytorch/pytorch 2025-12-04T12:47:55.2819416Z CONTINUE_THROUGH_ERROR=True 2025-12-04T12:47:55.2819564Z BUILD_ENVIRONMENT=linux-jammy-rocm-py3.10 2025-12-04T12:47:55.2819748Z HOSTNAME=linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8 2025-12-04T12:47:55.2820016Z GITHUB_PATH=/home/runner/_work/_temp/_runner_file_commands/add_path_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2820245Z GITHUB_ACTION=__run_2 2025-12-04T12:47:55.2820370Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T12:47:55.2820516Z GITHUB_RUN_NUMBER=688 2025-12-04T12:47:55.2820638Z TEST_CONFIG=distributed 2025-12-04T12:47:55.2820795Z RUNNER_NAME=linux.rocm.gpu.gfx942.4.b-bphpw-runner-qmdl8 2025-12-04T12:47:55.2820970Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T12:47:55.2821126Z AWS_DEFAULT_REGION=us-east-1 2025-12-04T12:47:55.2821282Z RUNNER_ARTIFACT_DIR=/home/runner/_work/_temp/artifacts 2025-12-04T12:47:55.2821452Z GITHUB_TRIGGERING_ACTOR=pytorchmergebot 2025-12-04T12:47:55.2821594Z GITHUB_REF_TYPE=branch 2025-12-04T12:47:55.2821734Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2822056Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T12:47:55.2822253Z *** 2025-12-04T12:47:55.2822358Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T12:47:55.2822491Z GITHUB_ACTIONS=true 2025-12-04T12:47:55.2822616Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2822785Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2823026Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk-rocm-mi300.yml@refs/heads/main 2025-12-04T12:47:55.2823244Z UCC_HOME=/usr 2025-12-04T12:47:55.2823354Z TORCH_SERIALIZATION_DEBUG=1 2025-12-04T12:47:55.2823481Z RUNNER_ENVIRONMENT=self-hosted 2025-12-04T12:47:55.2824110Z VERBOSE_TEST_LOGS=False 2025-12-04T12:47:55.2824237Z GITHUB_REF=refs/heads/main 2025-12-04T12:47:55.2824356Z RUNNER_OS=Linux 2025-12-04T12:47:55.2824468Z SHARD_NUMBER=3 2025-12-04T12:47:55.2824583Z GITHUB_REF_PROTECTED=true 2025-12-04T12:47:55.2824700Z RUNNER_MANUALLY_TRAP_SIG=1 2025-12-04T12:47:55.2824818Z HOME=/var/lib/jenkins 2025-12-04T12:47:55.2824957Z GITHUB_API_URL=https://api.github.com 2025-12-04T12:47:55.2825106Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2025-12-04T12:47:55.2825260Z RUNNER_DOCS_DIR=/home/runner/_work/_temp/docs 2025-12-04T12:47:55.2825399Z LANG=C.UTF-8 2025-12-04T12:47:55.2825527Z UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e 2025-12-04T12:47:55.2825681Z PYTORCH_TEST_WITH_ROCM=1 2025-12-04T12:47:55.2825842Z RUNNER_TRACKING_ID=github_abdf52a5-e903-48b0-b8ad-c48a5c6d512a 2025-12-04T12:47:55.2826169Z RUNNER_ARCH=X64 2025-12-04T12:47:55.2826284Z RUNNER_TEMP=/home/runner/_work/_temp 2025-12-04T12:47:55.2826412Z NUM_TEST_SHARDS=3 2025-12-04T12:47:55.2826522Z UCX_HOME=/usr 2025-12-04T12:47:55.2826736Z GITHUB_STATE=/home/runner/_work/_temp/_runner_file_commands/save_state_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2827051Z JOB_NAME=linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4.b, unstable) 2025-12-04T12:47:55.2827272Z MAGMA_HOME=/opt/rocm/magma 2025-12-04T12:47:55.2827543Z GITHUB_ENV=/home/runner/_work/_temp/_runner_file_commands/set_env_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2827814Z GITHUB_EVENT_PATH=/home/runner/_work/_temp/_github_workflow/event.json 2025-12-04T12:47:55.2827981Z GITHUB_EVENT_NAME=push 2025-12-04T12:47:55.2828141Z GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT=actions-runner-controller/0.12.1 2025-12-04T12:47:55.2828310Z DASHBOARD_TAG= 2025-12-04T12:47:55.2828417Z GITHUB_RUN_ID=19921726347 2025-12-04T12:47:55.2828635Z GITHUB_STEP_SUMMARY=/home/runner/_work/_temp/_runner_file_commands/step_summary_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2828863Z GITHUB_ACTOR=pytorchmergebot 2025-12-04T12:47:55.2828981Z PR_NUMBER= 2025-12-04T12:47:55.2829082Z GITHUB_RUN_ATTEMPT=1 2025-12-04T12:47:55.2829188Z VALGRIND=ON 2025-12-04T12:47:55.2829290Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T12:47:55.2829432Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T12:47:55.2829571Z TERM=vt100 2025-12-04T12:47:55.2829670Z INSTALLED_VISION=yes 2025-12-04T12:47:55.2829776Z BRANCH=main 2025-12-04T12:47:55.2829877Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T12:47:55.2829994Z TESTS_TO_INCLUDE= 2025-12-04T12:47:55.2830158Z GITHUB_ACTION_PATH=/home/runner/_work/pytorch/pytorch/./.github/actions/setup-rocm 2025-12-04T12:47:55.2830350Z GITHUB_SERVER_URL=https://github.com 2025-12-04T12:47:55.2830493Z PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100 2025-12-04T12:47:55.2830643Z UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77 2025-12-04T12:47:55.2830789Z REENABLED_ISSUES= 2025-12-04T12:47:55.2830887Z SHLVL=1 2025-12-04T12:47:55.2830982Z MAX_JOBS=126 2025-12-04T12:47:55.2831135Z RUNNER_TEST_RESULTS_DIR=/home/runner/_work/_temp/test-results 2025-12-04T12:47:55.2831296Z GITHUB_ACTOR_ID=97764156 2025-12-04T12:47:55.2831419Z RUNNER_TOOL_CACHE=/home/runner/_work/_tool 2025-12-04T12:47:55.2831586Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T12:47:55.2831741Z GITHUB_REF_NAME=main 2025-12-04T12:47:55.2831849Z ROCM_PATH=/opt/rocm 2025-12-04T12:47:55.2831956Z GITHUB_JOB=test 2025-12-04T12:47:55.2832060Z NO_TEST_TIMEOUT=False 2025-12-04T12:47:55.2832177Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T12:47:55.2832300Z LC_ALL=C.UTF-8 2025-12-04T12:47:55.2832402Z GITHUB_RETENTION_DAYS=90 2025-12-04T12:47:55.2832525Z RUNNER_WORKSPACE=/home/runner/_work/pytorch 2025-12-04T12:47:55.2832658Z OPENSSL_DIR=/opt/openssl 2025-12-04T12:47:55.2832773Z GITHUB_ACTION_REPOSITORY= 2025-12-04T12:47:55.2833129Z PATH=/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T12:47:55.2833527Z GITHUB_BASE_REF= 2025-12-04T12:47:55.2833627Z CI=true 2025-12-04T12:47:55.2833726Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T12:47:55.2833846Z JOB_ID=57113808223 2025-12-04T12:47:55.2833948Z GITHUB_HEAD_REF= 2025-12-04T12:47:55.2834048Z GITHUB_ACTION_REF= 2025-12-04T12:47:55.2834153Z TEST_SHOWLOCALS=False 2025-12-04T12:47:55.2834270Z GITHUB_WORKFLOW=trunk-rocm-mi300 2025-12-04T12:47:55.2834398Z DEBIAN_FRONTEND=noninteractive 2025-12-04T12:47:55.2834609Z GITHUB_OUTPUT=/home/runner/_work/_temp/_runner_file_commands/set_output_064d0f99-137c-48c0-a281-8df475af8492 2025-12-04T12:47:55.2834820Z NO_TD=False 2025-12-04T12:47:55.2834919Z OLDPWD=/var/lib/jenkins 2025-12-04T12:47:55.2835025Z _=/usr/bin/env 2025-12-04T12:47:55.2835142Z + echo 'Testing pytorch' 2025-12-04T12:47:55.2835324Z Testing pytorch 2025-12-04T12:47:55.2835423Z + export LANG=C.UTF-8 2025-12-04T12:47:55.2835528Z + LANG=C.UTF-8 2025-12-04T12:47:55.2835623Z + PR_NUMBER= 2025-12-04T12:47:55.2835728Z + [[ distributed == \d\e\f\a\u\l\t ]] 2025-12-04T12:47:55.2835866Z + [[ distributed == \d\i\s\t\r\i\b\u\t\e\d ]] 2025-12-04T12:47:55.2836007Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T12:47:55.2836143Z + export HIP_VISIBLE_DEVICES=0,1,2,3 2025-12-04T12:47:55.2836271Z + HIP_VISIBLE_DEVICES=0,1,2,3 2025-12-04T12:47:55.2836391Z + [[ distributed == \s\l\o\w ]] 2025-12-04T12:47:55.2836532Z + [[ linux-jammy-rocm-py3.10 == *slow-gradcheck* ]] 2025-12-04T12:47:55.2836684Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T12:47:55.2836820Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T12:47:55.2836961Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T12:47:55.2837102Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T12:47:55.2837236Z + [[ distributed == *crossref* ]] 2025-12-04T12:47:55.2837370Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T12:47:55.2837540Z + export VALGRIND=OFF 2025-12-04T12:47:55.2837646Z + VALGRIND=OFF 2025-12-04T12:47:55.2837742Z + rocminfo 2025-12-04T12:47:55.2939423Z ROCk module version 6.12.12 is loaded 2025-12-04T12:47:55.3605541Z ===================== 2025-12-04T12:47:55.3605930Z HSA System Attributes 2025-12-04T12:47:55.3606230Z ===================== 2025-12-04T12:47:55.3606522Z Runtime Version: 1.18 2025-12-04T12:47:55.3606834Z Runtime Ext Version: 1.14 2025-12-04T12:47:55.3607163Z System Timestamp Freq.: 1000.000000MHz 2025-12-04T12:47:55.3607772Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2025-12-04T12:47:55.3608337Z Machine Model: LARGE 2025-12-04T12:47:55.3608799Z System Endianness: LITTLE 2025-12-04T12:47:55.3609209Z Mwaitx: DISABLED 2025-12-04T12:47:55.3609544Z XNACK enabled: NO 2025-12-04T12:47:55.3609862Z DMAbuf Support: YES 2025-12-04T12:47:55.3610165Z VMM Support: YES 2025-12-04T12:47:55.3610355Z 2025-12-04T12:47:55.3610469Z ========== 2025-12-04T12:47:55.3610768Z HSA Agents 2025-12-04T12:47:55.3611055Z ========== 2025-12-04T12:47:55.3611323Z ******* 2025-12-04T12:47:55.3611592Z Agent 1 2025-12-04T12:47:55.3611861Z ******* 2025-12-04T12:47:55.3612201Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.3612625Z Uuid: CPU-XX 2025-12-04T12:47:55.3613248Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.3613698Z Vendor Name: CPU 2025-12-04T12:47:55.3614126Z Feature: None specified 2025-12-04T12:47:55.3614551Z Profile: FULL_PROFILE 2025-12-04T12:47:55.3614987Z Float Round Mode: NEAR 2025-12-04T12:47:55.3615424Z Max Queue Number: 0(0x0) 2025-12-04T12:47:55.3616042Z Queue Min Size: 0(0x0) 2025-12-04T12:47:55.3616461Z Queue Max Size: 0(0x0) 2025-12-04T12:47:55.3616885Z Queue Type: MULTI 2025-12-04T12:47:55.3617284Z Node: 0 2025-12-04T12:47:55.3617784Z Device Type: CPU 2025-12-04T12:47:55.3618080Z Cache Info: 2025-12-04T12:47:55.3618221Z L1: 49152(0xc000) KB 2025-12-04T12:47:55.3618372Z Chip ID: 0(0x0) 2025-12-04T12:47:55.3618530Z ASIC Revision: 0(0x0) 2025-12-04T12:47:55.3618766Z Cacheline Size: 64(0x40) 2025-12-04T12:47:55.3618932Z Max Clock Freq. (MHz): 3300 2025-12-04T12:47:55.3619086Z BDFID: 0 2025-12-04T12:47:55.3619250Z Internal Node ID: 0 2025-12-04T12:47:55.3619412Z Compute Unit: 64 2025-12-04T12:47:55.3619574Z SIMDs per CU: 0 2025-12-04T12:47:55.3619734Z Shader Engines: 0 2025-12-04T12:47:55.3619901Z Shader Arrs. per Eng.: 0 2025-12-04T12:47:55.3620070Z WatchPts on Addr. Ranges:1 2025-12-04T12:47:55.3620220Z Memory Properties: 2025-12-04T12:47:55.3620339Z Features: None 2025-12-04T12:47:55.3620460Z Pool Info: 2025-12-04T12:47:55.3620579Z Pool 1 2025-12-04T12:47:55.3620728Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3620893Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:47:55.3621059Z Allocatable: TRUE 2025-12-04T12:47:55.3621226Z Alloc Granule: 4KB 2025-12-04T12:47:55.3621400Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3621573Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3621745Z Accessible by all: TRUE 2025-12-04T12:47:55.3621891Z Pool 2 2025-12-04T12:47:55.3622033Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3622191Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:47:55.3622350Z Allocatable: TRUE 2025-12-04T12:47:55.3622523Z Alloc Granule: 4KB 2025-12-04T12:47:55.3622693Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3622867Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3623037Z Accessible by all: TRUE 2025-12-04T12:47:55.3623182Z Pool 3 2025-12-04T12:47:55.3623319Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T12:47:55.3623479Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:47:55.3623637Z Allocatable: TRUE 2025-12-04T12:47:55.3623802Z Alloc Granule: 4KB 2025-12-04T12:47:55.3623972Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3624145Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3624320Z Accessible by all: TRUE 2025-12-04T12:47:55.3624467Z Pool 4 2025-12-04T12:47:55.3624606Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3624921Z Size: 1584734448(0x5e7520f0) KB 2025-12-04T12:47:55.3625081Z Allocatable: TRUE 2025-12-04T12:47:55.3625246Z Alloc Granule: 4KB 2025-12-04T12:47:55.3625420Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3625595Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3625769Z Accessible by all: TRUE 2025-12-04T12:47:55.3625917Z ISA Info: 2025-12-04T12:47:55.3626031Z ******* 2025-12-04T12:47:55.3626144Z Agent 2 2025-12-04T12:47:55.3626307Z ******* 2025-12-04T12:47:55.3626435Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.3626592Z Uuid: CPU-XX 2025-12-04T12:47:55.3626759Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.3626931Z Vendor Name: CPU 2025-12-04T12:47:55.3627090Z Feature: None specified 2025-12-04T12:47:55.3627252Z Profile: FULL_PROFILE 2025-12-04T12:47:55.3627415Z Float Round Mode: NEAR 2025-12-04T12:47:55.3627630Z Max Queue Number: 0(0x0) 2025-12-04T12:47:55.3627792Z Queue Min Size: 0(0x0) 2025-12-04T12:47:55.3627953Z Queue Max Size: 0(0x0) 2025-12-04T12:47:55.3628112Z Queue Type: MULTI 2025-12-04T12:47:55.3628273Z Node: 1 2025-12-04T12:47:55.3628426Z Device Type: CPU 2025-12-04T12:47:55.3628570Z Cache Info: 2025-12-04T12:47:55.3628704Z L1: 49152(0xc000) KB 2025-12-04T12:47:55.3628854Z Chip ID: 0(0x0) 2025-12-04T12:47:55.3629009Z ASIC Revision: 0(0x0) 2025-12-04T12:47:55.3629170Z Cacheline Size: 64(0x40) 2025-12-04T12:47:55.3629332Z Max Clock Freq. (MHz): 3300 2025-12-04T12:47:55.3629488Z BDFID: 0 2025-12-04T12:47:55.3629641Z Internal Node ID: 1 2025-12-04T12:47:55.3629797Z Compute Unit: 64 2025-12-04T12:47:55.3629957Z SIMDs per CU: 0 2025-12-04T12:47:55.3630117Z Shader Engines: 0 2025-12-04T12:47:55.3630281Z Shader Arrs. per Eng.: 0 2025-12-04T12:47:55.3630449Z WatchPts on Addr. Ranges:1 2025-12-04T12:47:55.3630598Z Memory Properties: 2025-12-04T12:47:55.3630717Z Features: None 2025-12-04T12:47:55.3630836Z Pool Info: 2025-12-04T12:47:55.3630950Z Pool 1 2025-12-04T12:47:55.3631092Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3631289Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:47:55.3631442Z Allocatable: TRUE 2025-12-04T12:47:55.3631601Z Alloc Granule: 4KB 2025-12-04T12:47:55.3631810Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3631985Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3632181Z Accessible by all: TRUE 2025-12-04T12:47:55.3632475Z Pool 2 2025-12-04T12:47:55.3632610Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3632765Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:47:55.3632961Z Allocatable: TRUE 2025-12-04T12:47:55.3633121Z Alloc Granule: 4KB 2025-12-04T12:47:55.3633354Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3633520Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3633684Z Accessible by all: TRUE 2025-12-04T12:47:55.3633913Z Pool 3 2025-12-04T12:47:55.3634107Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T12:47:55.3634267Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:47:55.3634425Z Allocatable: TRUE 2025-12-04T12:47:55.3634591Z Alloc Granule: 4KB 2025-12-04T12:47:55.3634762Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3634936Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3635102Z Accessible by all: TRUE 2025-12-04T12:47:55.3635248Z Pool 4 2025-12-04T12:47:55.3635383Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3635543Z Size: 1585355628(0x5e7e9b6c) KB 2025-12-04T12:47:55.3635698Z Allocatable: TRUE 2025-12-04T12:47:55.3635867Z Alloc Granule: 4KB 2025-12-04T12:47:55.3636037Z Alloc Recommended Granule:4KB 2025-12-04T12:47:55.3636208Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3636376Z Accessible by all: TRUE 2025-12-04T12:47:55.3636522Z ISA Info: 2025-12-04T12:47:55.3636635Z ******* 2025-12-04T12:47:55.3636744Z Agent 3 2025-12-04T12:47:55.3636852Z ******* 2025-12-04T12:47:55.3655468Z Name: gfx942 2025-12-04T12:47:55.3655695Z Uuid: GPU-70f7a8c60f3fb761 2025-12-04T12:47:55.3655877Z Marketing Name: 2025-12-04T12:47:55.3656047Z Vendor Name: AMD 2025-12-04T12:47:55.3656215Z Feature: KERNEL_DISPATCH 2025-12-04T12:47:55.3656394Z Profile: BASE_PROFILE 2025-12-04T12:47:55.3656565Z Float Round Mode: NEAR 2025-12-04T12:47:55.3656739Z Max Queue Number: 128(0x80) 2025-12-04T12:47:55.3656907Z Queue Min Size: 64(0x40) 2025-12-04T12:47:55.3657073Z Queue Max Size: 131072(0x20000) 2025-12-04T12:47:55.3657240Z Queue Type: MULTI 2025-12-04T12:47:55.3657394Z Node: 2 2025-12-04T12:47:55.3657600Z Device Type: GPU 2025-12-04T12:47:55.3657748Z Cache Info: 2025-12-04T12:47:55.3657884Z L1: 32(0x20) KB 2025-12-04T12:47:55.3658034Z L2: 4096(0x1000) KB 2025-12-04T12:47:55.3658177Z L3: 262144(0x40000) KB 2025-12-04T12:47:55.3658326Z Chip ID: 29861(0x74a5) 2025-12-04T12:47:55.3658486Z ASIC Revision: 1(0x1) 2025-12-04T12:47:55.3658751Z Cacheline Size: 128(0x80) 2025-12-04T12:47:55.3658917Z Max Clock Freq. (MHz): 2100 2025-12-04T12:47:55.3659083Z BDFID: 29952 2025-12-04T12:47:55.3659242Z Internal Node ID: 2 2025-12-04T12:47:55.3659406Z Compute Unit: 304 2025-12-04T12:47:55.3659566Z SIMDs per CU: 4 2025-12-04T12:47:55.3659728Z Shader Engines: 32 2025-12-04T12:47:55.3659894Z Shader Arrs. per Eng.: 1 2025-12-04T12:47:55.3660106Z WatchPts on Addr. Ranges:4 2025-12-04T12:47:55.3660275Z Coherent Host Access: FALSE 2025-12-04T12:47:55.3660430Z Memory Properties: 2025-12-04T12:47:55.3660569Z Features: KERNEL_DISPATCH 2025-12-04T12:47:55.3660726Z Fast F16 Operation: TRUE 2025-12-04T12:47:55.3660902Z Wavefront Size: 64(0x40) 2025-12-04T12:47:55.3661071Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3661227Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3661368Z x 1024(0x400) 2025-12-04T12:47:55.3661507Z y 1024(0x400) 2025-12-04T12:47:55.3661644Z z 1024(0x400) 2025-12-04T12:47:55.3661795Z Max Waves Per CU: 32(0x20) 2025-12-04T12:47:55.3661969Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:47:55.3662134Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3662283Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3662417Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3662557Z y 65535(0xffff) 2025-12-04T12:47:55.3662696Z z 65535(0xffff) 2025-12-04T12:47:55.3662855Z Max fbarriers/Workgrp: 32 2025-12-04T12:47:55.3663085Z Packet Processor uCode:: 185 2025-12-04T12:47:55.3683835Z SDMA engine uCode:: 24 2025-12-04T12:47:55.3684000Z IOMMU Support:: None 2025-12-04T12:47:55.3684139Z Pool Info: 2025-12-04T12:47:55.3684249Z Pool 1 2025-12-04T12:47:55.3684390Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3684544Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3684695Z Allocatable: TRUE 2025-12-04T12:47:55.3684853Z Alloc Granule: 4KB 2025-12-04T12:47:55.3685016Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3685179Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3686954Z Accessible by all: FALSE 2025-12-04T12:47:55.3687095Z Pool 2 2025-12-04T12:47:55.3687227Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3687380Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3687554Z Allocatable: TRUE 2025-12-04T12:47:55.3687713Z Alloc Granule: 4KB 2025-12-04T12:47:55.3687874Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3688036Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3688237Z Accessible by all: FALSE 2025-12-04T12:47:55.3688373Z Pool 3 2025-12-04T12:47:55.3688500Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3688645Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3688790Z Allocatable: TRUE 2025-12-04T12:47:55.3688942Z Alloc Granule: 4KB 2025-12-04T12:47:55.3689101Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3689262Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3689419Z Accessible by all: FALSE 2025-12-04T12:47:55.3689591Z Pool 4 2025-12-04T12:47:55.3689714Z Segment: GROUP 2025-12-04T12:47:55.3689861Z Size: 64(0x40) KB 2025-12-04T12:47:55.3690005Z Allocatable: FALSE 2025-12-04T12:47:55.3690158Z Alloc Granule: 0KB 2025-12-04T12:47:55.3690318Z Alloc Recommended Granule:0KB 2025-12-04T12:47:55.3690479Z Alloc Alignment: 0KB 2025-12-04T12:47:55.3690635Z Accessible by all: FALSE 2025-12-04T12:47:55.3690772Z ISA Info: 2025-12-04T12:47:55.3690877Z ISA 1 2025-12-04T12:47:55.3691010Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:47:55.3691183Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3691343Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3691502Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3691667Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3691820Z Fast f16: TRUE 2025-12-04T12:47:55.3691970Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3692115Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3692248Z x 1024(0x400) 2025-12-04T12:47:55.3692389Z y 1024(0x400) 2025-12-04T12:47:55.3692527Z z 1024(0x400) 2025-12-04T12:47:55.3692677Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3692832Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3692961Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3693099Z y 65535(0xffff) 2025-12-04T12:47:55.3693238Z z 65535(0xffff) 2025-12-04T12:47:55.3693387Z FBarrier Max Size: 32 2025-12-04T12:47:55.3693531Z ISA 2 2025-12-04T12:47:55.3693682Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:47:55.3693868Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3694039Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3694207Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3694378Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3694541Z Fast f16: TRUE 2025-12-04T12:47:55.3694703Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3694856Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3695022Z x 1024(0x400) 2025-12-04T12:47:55.3695160Z y 1024(0x400) 2025-12-04T12:47:55.3695295Z z 1024(0x400) 2025-12-04T12:47:55.3695445Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3695592Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3695723Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3695862Z y 65535(0xffff) 2025-12-04T12:47:55.3696001Z z 65535(0xffff) 2025-12-04T12:47:55.3696153Z FBarrier Max Size: 32 2025-12-04T12:47:55.3696332Z ******* 2025-12-04T12:47:55.3696442Z Agent 4 2025-12-04T12:47:55.3696548Z ******* 2025-12-04T12:47:55.3696679Z Name: gfx942 2025-12-04T12:47:55.3696832Z Uuid: GPU-e6f4b2936c68b43f 2025-12-04T12:47:55.3696992Z Marketing Name: 2025-12-04T12:47:55.3697154Z Vendor Name: AMD 2025-12-04T12:47:55.3697312Z Feature: KERNEL_DISPATCH 2025-12-04T12:47:55.3697523Z Profile: BASE_PROFILE 2025-12-04T12:47:55.3697686Z Float Round Mode: NEAR 2025-12-04T12:47:55.3697845Z Max Queue Number: 128(0x80) 2025-12-04T12:47:55.3698001Z Queue Min Size: 64(0x40) 2025-12-04T12:47:55.3698157Z Queue Max Size: 131072(0x20000) 2025-12-04T12:47:55.3698311Z Queue Type: MULTI 2025-12-04T12:47:55.3698465Z Node: 3 2025-12-04T12:47:55.3698615Z Device Type: GPU 2025-12-04T12:47:55.3698756Z Cache Info: 2025-12-04T12:47:55.3698881Z L1: 32(0x20) KB 2025-12-04T12:47:55.3699022Z L2: 4096(0x1000) KB 2025-12-04T12:47:55.3699158Z L3: 262144(0x40000) KB 2025-12-04T12:47:55.3699299Z Chip ID: 29861(0x74a5) 2025-12-04T12:47:55.3699453Z ASIC Revision: 1(0x1) 2025-12-04T12:47:55.3699612Z Cacheline Size: 128(0x80) 2025-12-04T12:47:55.3699774Z Max Clock Freq. (MHz): 2100 2025-12-04T12:47:55.3699930Z BDFID: 1280 2025-12-04T12:47:55.3700090Z Internal Node ID: 3 2025-12-04T12:47:55.3700260Z Compute Unit: 304 2025-12-04T12:47:55.3700420Z SIMDs per CU: 4 2025-12-04T12:47:55.3700585Z Shader Engines: 32 2025-12-04T12:47:55.3700753Z Shader Arrs. per Eng.: 1 2025-12-04T12:47:55.3700925Z WatchPts on Addr. Ranges:4 2025-12-04T12:47:55.3701097Z Coherent Host Access: FALSE 2025-12-04T12:47:55.3701253Z Memory Properties: 2025-12-04T12:47:55.3701386Z Features: KERNEL_DISPATCH 2025-12-04T12:47:55.3701542Z Fast F16 Operation: TRUE 2025-12-04T12:47:55.3701712Z Wavefront Size: 64(0x40) 2025-12-04T12:47:55.3701880Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3702031Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3702208Z x 1024(0x400) 2025-12-04T12:47:55.3702346Z y 1024(0x400) 2025-12-04T12:47:55.3702479Z z 1024(0x400) 2025-12-04T12:47:55.3702625Z Max Waves Per CU: 32(0x20) 2025-12-04T12:47:55.3702786Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:47:55.3702947Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3703089Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3703214Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3703350Z y 65535(0xffff) 2025-12-04T12:47:55.3703529Z z 65535(0xffff) 2025-12-04T12:47:55.3703683Z Max fbarriers/Workgrp: 32 2025-12-04T12:47:55.3703868Z Packet Processor uCode:: 185 2025-12-04T12:47:55.3704038Z SDMA engine uCode:: 24 2025-12-04T12:47:55.3704205Z IOMMU Support:: None 2025-12-04T12:47:55.3704352Z Pool Info: 2025-12-04T12:47:55.3704472Z Pool 1 2025-12-04T12:47:55.3704619Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3704785Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3704949Z Allocatable: TRUE 2025-12-04T12:47:55.3705118Z Alloc Granule: 4KB 2025-12-04T12:47:55.3705303Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3705480Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3705652Z Accessible by all: FALSE 2025-12-04T12:47:55.3705808Z Pool 2 2025-12-04T12:47:55.3705952Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3706109Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3706268Z Allocatable: TRUE 2025-12-04T12:47:55.3706435Z Alloc Granule: 4KB 2025-12-04T12:47:55.3706607Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3706779Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3706951Z Accessible by all: FALSE 2025-12-04T12:47:55.3707106Z Pool 3 2025-12-04T12:47:55.3707246Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3707406Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3707609Z Allocatable: TRUE 2025-12-04T12:47:55.3707776Z Alloc Granule: 4KB 2025-12-04T12:47:55.3707949Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3708124Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3708295Z Accessible by all: FALSE 2025-12-04T12:47:55.3708446Z Pool 4 2025-12-04T12:47:55.3708583Z Segment: GROUP 2025-12-04T12:47:55.3708736Z Size: 64(0x40) KB 2025-12-04T12:47:55.3708895Z Allocatable: FALSE 2025-12-04T12:47:55.3709066Z Alloc Granule: 0KB 2025-12-04T12:47:55.3709241Z Alloc Recommended Granule:0KB 2025-12-04T12:47:55.3709451Z Alloc Alignment: 0KB 2025-12-04T12:47:55.3709620Z Accessible by all: FALSE 2025-12-04T12:47:55.3709770Z ISA Info: 2025-12-04T12:47:55.3709887Z ISA 1 2025-12-04T12:47:55.3710030Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:47:55.3710208Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3710380Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3710550Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3710724Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3710926Z Fast f16: TRUE 2025-12-04T12:47:55.3711089Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3711247Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3711386Z x 1024(0x400) 2025-12-04T12:47:55.3711523Z y 1024(0x400) 2025-12-04T12:47:55.3711657Z z 1024(0x400) 2025-12-04T12:47:55.3711802Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3711949Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3712077Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3712211Z y 65535(0xffff) 2025-12-04T12:47:55.3712346Z z 65535(0xffff) 2025-12-04T12:47:55.3712500Z FBarrier Max Size: 32 2025-12-04T12:47:55.3712639Z ISA 2 2025-12-04T12:47:55.3712783Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:47:55.3712969Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3713136Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3713299Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3713466Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3713623Z Fast f16: TRUE 2025-12-04T12:47:55.3713779Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3713928Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3714059Z x 1024(0x400) 2025-12-04T12:47:55.3714198Z y 1024(0x400) 2025-12-04T12:47:55.3714330Z z 1024(0x400) 2025-12-04T12:47:55.3714474Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3714616Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3714742Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3714904Z y 65535(0xffff) 2025-12-04T12:47:55.3715039Z z 65535(0xffff) 2025-12-04T12:47:55.3715186Z FBarrier Max Size: 32 2025-12-04T12:47:55.3715323Z ******* 2025-12-04T12:47:55.3715427Z Agent 5 2025-12-04T12:47:55.3715528Z ******* 2025-12-04T12:47:55.3715654Z Name: gfx942 2025-12-04T12:47:55.3715802Z Uuid: GPU-21bc7b5a7907f984 2025-12-04T12:47:55.3715957Z Marketing Name: 2025-12-04T12:47:55.3716113Z Vendor Name: AMD 2025-12-04T12:47:55.3716298Z Feature: KERNEL_DISPATCH 2025-12-04T12:47:55.3716451Z Profile: BASE_PROFILE 2025-12-04T12:47:55.3716606Z Float Round Mode: NEAR 2025-12-04T12:47:55.3716761Z Max Queue Number: 128(0x80) 2025-12-04T12:47:55.3716913Z Queue Min Size: 64(0x40) 2025-12-04T12:47:55.3717062Z Queue Max Size: 131072(0x20000) 2025-12-04T12:47:55.3717210Z Queue Type: MULTI 2025-12-04T12:47:55.3717354Z Node: 4 2025-12-04T12:47:55.3717543Z Device Type: GPU 2025-12-04T12:47:55.3717717Z Cache Info: 2025-12-04T12:47:55.3717833Z L1: 32(0x20) KB 2025-12-04T12:47:55.3717967Z L2: 4096(0x1000) KB 2025-12-04T12:47:55.3718101Z L3: 262144(0x40000) KB 2025-12-04T12:47:55.3718237Z Chip ID: 29861(0x74a5) 2025-12-04T12:47:55.3718388Z ASIC Revision: 1(0x1) 2025-12-04T12:47:55.3718546Z Cacheline Size: 128(0x80) 2025-12-04T12:47:55.3718707Z Max Clock Freq. (MHz): 2100 2025-12-04T12:47:55.3718860Z BDFID: 25856 2025-12-04T12:47:55.3719009Z Internal Node ID: 4 2025-12-04T12:47:55.3719165Z Compute Unit: 304 2025-12-04T12:47:55.3719323Z SIMDs per CU: 4 2025-12-04T12:47:55.3719478Z Shader Engines: 32 2025-12-04T12:47:55.3719639Z Shader Arrs. per Eng.: 1 2025-12-04T12:47:55.3719808Z WatchPts on Addr. Ranges:4 2025-12-04T12:47:55.3719973Z Coherent Host Access: FALSE 2025-12-04T12:47:55.3720123Z Memory Properties: 2025-12-04T12:47:55.3720251Z Features: KERNEL_DISPATCH 2025-12-04T12:47:55.3720401Z Fast F16 Operation: TRUE 2025-12-04T12:47:55.3720567Z Wavefront Size: 64(0x40) 2025-12-04T12:47:55.3720731Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3720882Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3721015Z x 1024(0x400) 2025-12-04T12:47:55.3721151Z y 1024(0x400) 2025-12-04T12:47:55.3721282Z z 1024(0x400) 2025-12-04T12:47:55.3721426Z Max Waves Per CU: 32(0x20) 2025-12-04T12:47:55.3721596Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:47:55.3721758Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3721899Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3722020Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3722152Z y 65535(0xffff) 2025-12-04T12:47:55.3722283Z z 65535(0xffff) 2025-12-04T12:47:55.3722435Z Max fbarriers/Workgrp: 32 2025-12-04T12:47:55.3722604Z Packet Processor uCode:: 185 2025-12-04T12:47:55.3722767Z SDMA engine uCode:: 24 2025-12-04T12:47:55.3722927Z IOMMU Support:: None 2025-12-04T12:47:55.3723066Z Pool Info: 2025-12-04T12:47:55.3723177Z Pool 1 2025-12-04T12:47:55.3723407Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3723564Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3723716Z Allocatable: TRUE 2025-12-04T12:47:55.3723876Z Alloc Granule: 4KB 2025-12-04T12:47:55.3724043Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3724212Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3724376Z Accessible by all: FALSE 2025-12-04T12:47:55.3724518Z Pool 2 2025-12-04T12:47:55.3724652Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3724836Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3724986Z Allocatable: TRUE 2025-12-04T12:47:55.3725149Z Alloc Granule: 4KB 2025-12-04T12:47:55.3725314Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3725480Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3725642Z Accessible by all: FALSE 2025-12-04T12:47:55.3725785Z Pool 3 2025-12-04T12:47:55.3725917Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3726067Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3726216Z Allocatable: TRUE 2025-12-04T12:47:55.3726374Z Alloc Granule: 4KB 2025-12-04T12:47:55.3726544Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3726709Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3726874Z Accessible by all: FALSE 2025-12-04T12:47:55.3727013Z Pool 4 2025-12-04T12:47:55.3727139Z Segment: GROUP 2025-12-04T12:47:55.3727284Z Size: 64(0x40) KB 2025-12-04T12:47:55.3727431Z Allocatable: FALSE 2025-12-04T12:47:55.3727629Z Alloc Granule: 0KB 2025-12-04T12:47:55.3727795Z Alloc Recommended Granule:0KB 2025-12-04T12:47:55.3727962Z Alloc Alignment: 0KB 2025-12-04T12:47:55.3728123Z Accessible by all: FALSE 2025-12-04T12:47:55.3728268Z ISA Info: 2025-12-04T12:47:55.3728375Z ISA 1 2025-12-04T12:47:55.3728512Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:47:55.3728683Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3728848Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3729010Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3729175Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3729329Z Fast f16: TRUE 2025-12-04T12:47:55.3729480Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3729627Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3729760Z x 1024(0x400) 2025-12-04T12:47:55.3729896Z y 1024(0x400) 2025-12-04T12:47:55.3730028Z z 1024(0x400) 2025-12-04T12:47:55.3730171Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3730360Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3730487Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3730622Z y 65535(0xffff) 2025-12-04T12:47:55.3730755Z z 65535(0xffff) 2025-12-04T12:47:55.3730904Z FBarrier Max Size: 32 2025-12-04T12:47:55.3731041Z ISA 2 2025-12-04T12:47:55.3731186Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:47:55.3731363Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3731560Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3731723Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3731888Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3732050Z Fast f16: TRUE 2025-12-04T12:47:55.3732206Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3732355Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3732486Z x 1024(0x400) 2025-12-04T12:47:55.3732622Z y 1024(0x400) 2025-12-04T12:47:55.3732753Z z 1024(0x400) 2025-12-04T12:47:55.3732897Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3733037Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3733164Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3733303Z y 65535(0xffff) 2025-12-04T12:47:55.3733437Z z 65535(0xffff) 2025-12-04T12:47:55.3733590Z FBarrier Max Size: 32 2025-12-04T12:47:55.3733727Z ******* 2025-12-04T12:47:55.3733832Z Agent 6 2025-12-04T12:47:55.3733934Z ******* 2025-12-04T12:47:55.3734051Z Name: gfx942 2025-12-04T12:47:55.3734199Z Uuid: GPU-992528b4a4dce35a 2025-12-04T12:47:55.3734354Z Marketing Name: 2025-12-04T12:47:55.3734509Z Vendor Name: AMD 2025-12-04T12:47:55.3734663Z Feature: KERNEL_DISPATCH 2025-12-04T12:47:55.3734818Z Profile: BASE_PROFILE 2025-12-04T12:47:55.3734979Z Float Round Mode: NEAR 2025-12-04T12:47:55.3735137Z Max Queue Number: 128(0x80) 2025-12-04T12:47:55.3735296Z Queue Min Size: 64(0x40) 2025-12-04T12:47:55.3735450Z Queue Max Size: 131072(0x20000) 2025-12-04T12:47:55.3735606Z Queue Type: MULTI 2025-12-04T12:47:55.3735754Z Node: 5 2025-12-04T12:47:55.3735899Z Device Type: GPU 2025-12-04T12:47:55.3736036Z Cache Info: 2025-12-04T12:47:55.3736155Z L1: 32(0x20) KB 2025-12-04T12:47:55.3736290Z L2: 4096(0x1000) KB 2025-12-04T12:47:55.3736422Z L3: 262144(0x40000) KB 2025-12-04T12:47:55.3736562Z Chip ID: 29861(0x74a5) 2025-12-04T12:47:55.3736724Z ASIC Revision: 1(0x1) 2025-12-04T12:47:55.3736889Z Cacheline Size: 128(0x80) 2025-12-04T12:47:55.3737090Z Max Clock Freq. (MHz): 2100 2025-12-04T12:47:55.3737249Z BDFID: 5376 2025-12-04T12:47:55.3737408Z Internal Node ID: 5 2025-12-04T12:47:55.3737618Z Compute Unit: 304 2025-12-04T12:47:55.3737779Z SIMDs per CU: 4 2025-12-04T12:47:55.3737943Z Shader Engines: 32 2025-12-04T12:47:55.3738111Z Shader Arrs. per Eng.: 1 2025-12-04T12:47:55.3738282Z WatchPts on Addr. Ranges:4 2025-12-04T12:47:55.3738493Z Coherent Host Access: FALSE 2025-12-04T12:47:55.3738647Z Memory Properties: 2025-12-04T12:47:55.3738776Z Features: KERNEL_DISPATCH 2025-12-04T12:47:55.3738934Z Fast F16 Operation: TRUE 2025-12-04T12:47:55.3739103Z Wavefront Size: 64(0x40) 2025-12-04T12:47:55.3739272Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3739427Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3739566Z x 1024(0x400) 2025-12-04T12:47:55.3739706Z y 1024(0x400) 2025-12-04T12:47:55.3739845Z z 1024(0x400) 2025-12-04T12:47:55.3739997Z Max Waves Per CU: 32(0x20) 2025-12-04T12:47:55.3740164Z Max Work-item Per CU: 2048(0x800) 2025-12-04T12:47:55.3740334Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3740484Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3740614Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3740761Z y 65535(0xffff) 2025-12-04T12:47:55.3740901Z z 65535(0xffff) 2025-12-04T12:47:55.3741060Z Max fbarriers/Workgrp: 32 2025-12-04T12:47:55.3741237Z Packet Processor uCode:: 185 2025-12-04T12:47:55.3741410Z SDMA engine uCode:: 24 2025-12-04T12:47:55.3741578Z IOMMU Support:: None 2025-12-04T12:47:55.3741727Z Pool Info: 2025-12-04T12:47:55.3741845Z Pool 1 2025-12-04T12:47:55.3741989Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T12:47:55.3742156Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3742317Z Allocatable: TRUE 2025-12-04T12:47:55.3742483Z Alloc Granule: 4KB 2025-12-04T12:47:55.3742661Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3742835Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3743006Z Accessible by all: FALSE 2025-12-04T12:47:55.3743155Z Pool 2 2025-12-04T12:47:55.3743300Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T12:47:55.3743462Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3743621Z Allocatable: TRUE 2025-12-04T12:47:55.3743788Z Alloc Granule: 4KB 2025-12-04T12:47:55.3743971Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3744146Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3744316Z Accessible by all: FALSE 2025-12-04T12:47:55.3744508Z Pool 3 2025-12-04T12:47:55.3744649Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T12:47:55.3744809Z Size: 268419072(0xfffc000) KB 2025-12-04T12:47:55.3744967Z Allocatable: TRUE 2025-12-04T12:47:55.3745133Z Alloc Granule: 4KB 2025-12-04T12:47:55.3745307Z Alloc Recommended Granule:2048KB 2025-12-04T12:47:55.3745482Z Alloc Alignment: 4KB 2025-12-04T12:47:55.3745651Z Accessible by all: FALSE 2025-12-04T12:47:55.3745827Z Pool 4 2025-12-04T12:47:55.3745962Z Segment: GROUP 2025-12-04T12:47:55.3746116Z Size: 64(0x40) KB 2025-12-04T12:47:55.3746278Z Allocatable: FALSE 2025-12-04T12:47:55.3746446Z Alloc Granule: 0KB 2025-12-04T12:47:55.3746620Z Alloc Recommended Granule:0KB 2025-12-04T12:47:55.3746794Z Alloc Alignment: 0KB 2025-12-04T12:47:55.3746963Z Accessible by all: FALSE 2025-12-04T12:47:55.3747113Z ISA Info: 2025-12-04T12:47:55.3747230Z ISA 1 2025-12-04T12:47:55.3747373Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T12:47:55.3747591Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3747768Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3747940Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3748121Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3748286Z Fast f16: TRUE 2025-12-04T12:47:55.3748449Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3748606Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3748747Z x 1024(0x400) 2025-12-04T12:47:55.3748889Z y 1024(0x400) 2025-12-04T12:47:55.3749030Z z 1024(0x400) 2025-12-04T12:47:55.3749183Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3749333Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3749469Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3749612Z y 65535(0xffff) 2025-12-04T12:47:55.3749752Z z 65535(0xffff) 2025-12-04T12:47:55.3749909Z FBarrier Max Size: 32 2025-12-04T12:47:55.3750057Z ISA 2 2025-12-04T12:47:55.3750210Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T12:47:55.3750396Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T12:47:55.3750570Z Profiles: HSA_PROFILE_BASE 2025-12-04T12:47:55.3750741Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3750917Z Default Rounding Mode: NEAR 2025-12-04T12:47:55.3751081Z Fast f16: TRUE 2025-12-04T12:47:55.3751249Z Workgroup Max Size: 1024(0x400) 2025-12-04T12:47:55.3751405Z Workgroup Max Size per Dimension: 2025-12-04T12:47:55.3751546Z x 1024(0x400) 2025-12-04T12:47:55.3751719Z y 1024(0x400) 2025-12-04T12:47:55.3751854Z z 1024(0x400) 2025-12-04T12:47:55.3752001Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T12:47:55.3752146Z Grid Max Size per Dimension: 2025-12-04T12:47:55.3752275Z x 2147483647(0x7fffffff) 2025-12-04T12:47:55.3752413Z y 65535(0xffff) 2025-12-04T12:47:55.3752550Z z 65535(0xffff) 2025-12-04T12:47:55.3752702Z FBarrier Max Size: 32 2025-12-04T12:47:55.3752883Z *** Done *** 2025-12-04T12:47:55.3752995Z + rocminfo 2025-12-04T12:47:55.3753102Z + grep -E 'Name:.*\sgfx|Marketing' 2025-12-04T12:47:55.4590207Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.4590743Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T12:47:55.4591180Z Name: gfx942 2025-12-04T12:47:55.4591593Z Marketing Name: 2025-12-04T12:47:55.4591995Z Name: gfx942 2025-12-04T12:47:55.4592392Z Marketing Name: 2025-12-04T12:47:55.4592789Z Name: gfx942 2025-12-04T12:47:55.4593188Z Marketing Name: 2025-12-04T12:47:55.4593584Z Name: gfx942 2025-12-04T12:47:55.4593980Z Marketing Name: 2025-12-04T12:47:55.4694348Z + MAYBE_ROCM=rocm/ 2025-12-04T12:47:55.4694700Z + [[ linux-jammy-rocm-py3.10 == *xpu* ]] 2025-12-04T12:47:55.4695010Z + [[ linux-jammy-rocm-py3.10 != *-bazel-* ]] 2025-12-04T12:47:55.4695273Z + pip_install ninja==1.10.2 2025-12-04T12:47:55.4695590Z + pip_install_pkg='python3 -m pip install --progress-bar off' 2025-12-04T12:47:55.4695951Z + python3 -m pip install --progress-bar off ninja==1.10.2 2025-12-04T12:47:55.6915886Z Collecting ninja==1.10.2 2025-12-04T12:47:55.7502611Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2025-12-04T12:47:55.7673989Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2025-12-04T12:47:55.9520631Z Installing collected packages: ninja 2025-12-04T12:47:55.9520933Z Attempting uninstall: ninja 2025-12-04T12:47:55.9524359Z Found existing installation: ninja 1.11.1.4 2025-12-04T12:47:55.9534859Z Uninstalling ninja-1.11.1.4: 2025-12-04T12:47:55.9561587Z Successfully uninstalled ninja-1.11.1.4 2025-12-04T12:47:55.9667797Z Successfully installed ninja-1.10.2 2025-12-04T12:47:56.0036722Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T12:47:56.0038234Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T12:47:56.0039021Z + [[ linux-jammy-rocm-py3.10 == *aarch64* ]] 2025-12-04T12:47:56.0039294Z + [[ linux-jammy-rocm-py3.10 == *asan* ]] 2025-12-04T12:47:56.0039556Z + [[ linux-jammy-rocm-py3.10 == *-debug* ]] 2025-12-04T12:47:56.0039816Z + [[ linux-jammy-rocm-py3.10 != *-bazel-* ]] 2025-12-04T12:47:56.0040181Z + echo 'We are not in debug mode: linux-jammy-rocm-py3.10. Expect the assertion to pass' 2025-12-04T12:47:56.0040626Z We are not in debug mode: linux-jammy-rocm-py3.10. Expect the assertion to pass 2025-12-04T12:47:56.0040946Z + cd test 2025-12-04T12:47:56.0041446Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T12:47:56.8030441Z + [[ distributed == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2025-12-04T12:47:56.8030647Z + [[ distributed == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2025-12-04T12:47:56.8030808Z + [[ distributed == \l\e\g\a\c\y\_\n\v\i\d\i\a\_\d\r\i\v\e\r ]] 2025-12-04T12:47:56.8032896Z + DYNAMO_BENCHMARK_FLAGS=() 2025-12-04T12:47:56.8033473Z + [[ distributed == *pr_time_benchmarks* ]] 2025-12-04T12:47:56.8035213Z + [[ distributed == *dynamo_eager* ]] 2025-12-04T12:47:56.8035722Z + [[ distributed == *aot_eager* ]] 2025-12-04T12:47:56.8036081Z + [[ distributed == *aot_inductor* ]] 2025-12-04T12:47:56.8036460Z + [[ distributed == *max_autotune_inductor* ]] 2025-12-04T12:47:56.8036825Z + [[ distributed == *inductor* ]] 2025-12-04T12:47:56.8037669Z + [[ distributed == *dynamic* ]] 2025-12-04T12:47:56.8037999Z + [[ distributed == *cpu* ]] 2025-12-04T12:47:56.8038311Z + [[ distributed == *xpu* ]] 2025-12-04T12:47:56.8038662Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2025-12-04T12:47:56.8050926Z + [[ linux-jammy-rocm-py3.10 == *libtorch* ]] 2025-12-04T12:47:56.8051193Z + [[ linux-jammy-rocm-py3.10 == *-bazel-* ]] 2025-12-04T12:47:56.8054213Z + cd test 2025-12-04T12:47:56.8054948Z + python -c 'import torch; print(torch.__config__.show())' 2025-12-04T12:47:57.5355113Z PyTorch built with: 2025-12-04T12:47:57.5355449Z - GCC 11.4 2025-12-04T12:47:57.5355662Z - C++ Version: 201703 2025-12-04T12:47:57.5356158Z - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T12:47:57.5356729Z - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T12:47:57.5357093Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T12:47:57.5357425Z - LAPACK is enabled (usually provided by MKL) 2025-12-04T12:47:57.5357848Z - NNPACK is enabled 2025-12-04T12:47:57.5358077Z - CPU capability usage: AVX512 2025-12-04T12:47:57.5358327Z - HIP Runtime 7.1.25424 2025-12-04T12:47:57.5358585Z - MIOpen 3.5.1 2025-12-04T12:47:57.5358784Z - Magma 2.9.0 2025-12-04T12:47:57.7910492Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_FBGEMM_GENAI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=ON, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 2025-12-04T12:47:57.7913965Z 2025-12-04T12:47:57.7914052Z + cd test 2025-12-04T12:47:57.7914316Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2025-12-04T12:47:58.4084686Z ATen/Parallel: 2025-12-04T12:47:58.4085069Z at::get_num_threads() : 128 2025-12-04T12:47:58.4085348Z at::get_num_interop_threads() : 128 2025-12-04T12:47:58.4085617Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T12:47:58.4085868Z omp_get_max_threads() : 128 2025-12-04T12:47:58.4086326Z Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T12:47:58.4086819Z mkl_get_max_threads() : 128 2025-12-04T12:47:58.4087139Z Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T12:47:58.4087983Z std::thread::hardware_concurrency() : 128 2025-12-04T12:47:58.4088234Z Environment variables: 2025-12-04T12:47:58.4088442Z OMP_NUM_THREADS : [not set] 2025-12-04T12:47:58.4088653Z MKL_NUM_THREADS : [not set] 2025-12-04T12:47:58.4088871Z ATen parallel backend: OpenMP 2025-12-04T12:47:58.4089017Z 2025-12-04T12:47:58.6001511Z + [[ distributed == *numpy_2* ]] 2025-12-04T12:47:58.6001752Z + [[ linux-jammy-rocm-py3.10 == *aarch64* ]] 2025-12-04T12:47:58.6001902Z + [[ distributed == *backward* ]] 2025-12-04T12:47:58.6002047Z + [[ distributed == *libtorch_agnostic_targetting* ]] 2025-12-04T12:47:58.6002195Z + [[ distributed == *xla* ]] 2025-12-04T12:47:58.6002317Z + [[ distributed == *vllm* ]] 2025-12-04T12:47:58.6002438Z + [[ distributed == *executorch* ]] 2025-12-04T12:47:58.6002911Z + [[ distributed == \j\i\t\_\l\e\g\a\c\y ]] 2025-12-04T12:47:58.6003045Z + [[ distributed == \q\u\a\n\t\i\z\a\t\i\o\n ]] 2025-12-04T12:47:58.6003189Z + [[ linux-jammy-rocm-py3.10 == *libtorch* ]] 2025-12-04T12:47:58.6003320Z + [[ distributed == distributed ]] 2025-12-04T12:47:58.6003448Z + test_distributed 2025-12-04T12:47:58.6003560Z + echo 'Testing distributed python tests' 2025-12-04T12:47:58.6003690Z Testing distributed python tests 2025-12-04T12:47:58.6003857Z + python test/run_test.py --distributed-tests --shard 3 3 --verbose 2025-12-04T12:48:00.3096860Z Excluding distributed/rpc/test_faulty_agent on ROCm 2025-12-04T12:48:00.3097371Z Excluding distributed/rpc/test_tensorpipe_agent on ROCm 2025-12-04T12:48:00.3097846Z Excluding distributed/rpc/test_share_memory on ROCm 2025-12-04T12:48:00.3098255Z Excluding distributed/rpc/cuda/test_tensorpipe_agent on ROCm 2025-12-04T12:48:01.2961755Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/pytorch/test/.pytorch-disabled-tests.json 2025-12-04T12:48:01.6754578Z Ignoring disabled issues: [''] 2025-12-04T12:48:01.6801847Z Found test times from artifacts 2025-12-04T12:48:01.6965020Z Found test times from artifacts 2025-12-04T12:48:01.6968861Z Running all tests 2025-12-04T12:48:01.7035358Z Running parallel tests on 4 processes 2025-12-04T12:48:01.7037840Z Name: tests to run (est. time: 169.99min) 2025-12-04T12:48:01.7038169Z Serial tests (114): 2025-12-04T12:48:01.7038420Z distributed/test_dynamo_distributed 1/1 2025-12-04T12:48:01.7038700Z distributed/tensor/test_op_schema 1/1 2025-12-04T12:48:01.7038978Z distributed/checkpoint/test_nested_dict 1/1 2025-12-04T12:48:01.7039300Z distributed/checkpoint/test_consolidate_hf_safetensors 1/1 2025-12-04T12:48:01.7039637Z distributed/tensor/test_dtensor_compile 3/4 2025-12-04T12:48:01.7039957Z distributed/checkpoint/_experimental/test_barriers 1/1 2025-12-04T12:48:01.7040282Z distributed/pipelining/test_transformer 1/1 2025-12-04T12:48:01.7040612Z distributed/flight_recorder/test_fr_analysis 1/1 2025-12-04T12:48:01.7040905Z distributed/_composable/test_contract 1/1 2025-12-04T12:48:01.7041189Z distributed/checkpoint/test_dedup_tensors 1/1 2025-12-04T12:48:01.7041490Z distributed/test_c10d_functional_native 1/1 2025-12-04T12:48:01.7041779Z distributed/pipelining/test_backward 1/1 2025-12-04T12:48:01.7042068Z distributed/test_nvshmem_triton 1/1 2025-12-04T12:48:01.7042326Z distributed/tensor/test_dtensor 1/3 2025-12-04T12:48:01.7042579Z distributed/test_cupy_as_tensor 1/1 2025-12-04T12:48:01.7042833Z distributed/fsdp/test_fsdp_fx 1/1 2025-12-04T12:48:01.7043088Z distributed/_tools/test_sac_ilp 1/1 2025-12-04T12:48:01.7043354Z distributed/checkpoint/test_hf_storage 1/1 2025-12-04T12:48:01.7043641Z distributed/pipelining/test_microbatch 1/1 2025-12-04T12:48:01.7043914Z distributed/tensor/test_placement_types 1/1 2025-12-04T12:48:01.7044222Z distributed/tensor/test_dtensor_dispatch_overhead 1/1 2025-12-04T12:48:01.7044597Z distributed/checkpoint/_experimental/test_checkpoint_reader 1/1 2025-12-04T12:48:01.7044937Z distributed/checkpoint/test_format_utils 1/1 2025-12-04T12:48:01.7045242Z distributed/test_aten_comm_compute_reordering 1/3 2025-12-04T12:48:01.7045914Z distributed/test_p2p_ipc 1/1 2025-12-04T12:48:01.7046173Z distributed/tensor/test_common_rules 1/1 2025-12-04T12:48:01.7046473Z distributed/checkpoint/test_hf_safetensor_e2e 1/1 2025-12-04T12:48:01.7046774Z distributed/_tools/test_sac_estimator 1/1 2025-12-04T12:48:01.7047041Z distributed/_tools/test_memory_tracker 1/1 2025-12-04T12:48:01.7047343Z distributed/checkpoint/_experimental/test_builder 1/1 2025-12-04T12:48:01.7047727Z distributed/_composable/test_replicate_with_fsdp 1/1 2025-12-04T12:48:01.7048030Z distributed/tensor/test_xla_integration 1/1 2025-12-04T12:48:01.7048354Z distributed/checkpoint/_experimental/test_types 1/1 2025-12-04T12:48:01.7048673Z distributed/tensor/experimental/test_register_sharding 1/1 2025-12-04T12:48:01.7049060Z distributed/test_backends 1/1 2025-12-04T12:48:01.7049253Z distributed/tensor/test_experimental_ops 1/1 2025-12-04T12:48:01.7049490Z distributed/checkpoint/test_quantized_hf_storage 1/1 2025-12-04T12:48:01.7049782Z distributed/_composable/test_composability/test_pp_composability 1/1 2025-12-04T12:48:01.7050070Z distributed/checkpoint/test_async_process_executor 1/1 2025-12-04T12:48:01.7050307Z distributed/tensor/test_tensor_ops 1/4 2025-12-04T12:48:01.7050510Z distributed/tensor/test_tensor_ops 4/4 2025-12-04T12:48:01.7050723Z distributed/checkpoint/fsdp/test_fsdp_dsd 1/1 2025-12-04T12:48:01.7050946Z distributed/checkpoint/test_save_load_api 1/1 2025-12-04T12:48:01.7051183Z distributed/tensor/debug/test_comm_mode_features 1/1 2025-12-04T12:48:01.7051405Z distributed/checkpoint/test_traverse 1/1 2025-12-04T12:48:01.7051607Z distributed/tensor/test_random_ops 1/1 2025-12-04T12:48:01.7051850Z distributed/_composable/test_replicate_mixed_precision 1/1 2025-12-04T12:48:01.7052124Z distributed/_composable/fsdp/test_fully_shard_logging 1/1 2025-12-04T12:48:01.7052410Z distributed/_composable/fsdp/test_fully_shard_ignore_params 1/1 2025-12-04T12:48:01.7052684Z distributed/checkpoint/_experimental/test_staging 1/1 2025-12-04T12:48:01.7052964Z distributed/checkpoint/test_fsdp_tp_checkpoint_conversion 1/1 2025-12-04T12:48:01.7053218Z distributed/launcher/test_api 1/1 2025-12-04T12:48:01.7063009Z distributed/elastic/multiprocessing/test_api 1/1 2025-12-04T12:48:01.7063203Z distributed/fsdp/test_shard_utils 1/1 2025-12-04T12:48:01.7063388Z distributed/tensor/experimental/test_local_map 1/1 2025-12-04T12:48:01.7063568Z distributed/test_local_tensor 1/1 2025-12-04T12:48:01.7063755Z distributed/_composable/fsdp/test_fully_shard_state 1/1 2025-12-04T12:48:01.7063959Z distributed/checkpoint/test_tp_checkpoint 1/1 2025-12-04T12:48:01.7064140Z distributed/tensor/test_optimizers 1/1 2025-12-04T12:48:01.7064313Z distributed/test_symmetric_memory 1/1 2025-12-04T12:48:01.7064487Z distributed/_tools/test_runtime_estimator 1/1 2025-12-04T12:48:01.7064659Z distributed/fsdp/test_fsdp_memory 1/1 2025-12-04T12:48:01.7064828Z distributed/test_fake_pg 1/1 2025-12-04T12:48:01.7064990Z distributed/checkpoint/test_fsdp_model_state 1/1 2025-12-04T12:48:01.7065164Z distributed/fsdp/test_utils 1/1 2025-12-04T12:48:01.7065336Z distributed/tensor/parallel/test_tp_examples 1/1 2025-12-04T12:48:01.7065548Z distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_ 1/1 2025-12-04T12:48:01.7065755Z distributed/tensor/debug/test_comm_mode 1/1 2025-12-04T12:48:01.7065918Z distributed/test_dist2 1/1 2025-12-04T12:48:01.7066099Z distributed/_composable/fsdp/test_fully_shard_grad_scaler 1/1 2025-12-04T12:48:01.7066291Z distributed/launcher/test_run 1/1 2025-12-04T12:48:01.7066460Z distributed/fsdp/test_fsdp_backward_prefetch 1/1 2025-12-04T12:48:01.7066639Z distributed/fsdp/test_fsdp_pure_fp16 1/1 2025-12-04T12:48:01.7066806Z distributed/checkpoint/test_checkpoint 1/1 2025-12-04T12:48:01.7066970Z distributed/fsdp/test_fsdp_apply 1/1 2025-12-04T12:48:01.7067227Z distributed/_composable/fsdp/test_fully_shard_frozen 1/1 2025-12-04T12:48:01.7067425Z distributed/checkpoint/test_hsdp_checkpoint 1/1 2025-12-04T12:48:01.7067674Z distributed/tensor/parallel/test_parallelize_api 1/1 2025-12-04T12:48:01.7067860Z distributed/fsdp/test_fsdp_state_dict 1/2 2025-12-04T12:48:01.7068043Z distributed/_composable/fsdp/test_fully_shard_init 1/1 2025-12-04T12:48:01.7068213Z distributed/fsdp/test_fsdp_flatten_params 1/1 2025-12-04T12:48:01.7068360Z distributed/test_distributed_spawn 3/7 2025-12-04T12:48:01.7068502Z distributed/test_distributed_spawn 6/7 2025-12-04T12:48:01.7068640Z distributed/test_serialization 1/1 2025-12-04T12:48:01.7068787Z distributed/fsdp/test_fsdp_multiple_wrapping 1/1 2025-12-04T12:48:01.7069004Z distributed/_composable/fsdp/test_fully_shard_comm 1/1 2025-12-04T12:48:01.7069174Z distributed/checkpoint/test_file_system_checkpoint 1/1 2025-12-04T12:48:01.7069330Z distributed/test_composability 1/1 2025-12-04T12:48:01.7069481Z distributed/fsdp/test_fsdp_dtensor_state_dict 1/1 2025-12-04T12:48:01.7069632Z distributed/fsdp/test_fsdp_comm_hooks 1/1 2025-12-04T12:48:01.7069769Z distributed/_shard/test_sharder 1/1 2025-12-04T12:48:01.7069922Z distributed/_shard/sharded_tensor/ops/test_tensor_ops 1/1 2025-12-04T12:48:01.7070089Z distributed/fsdp/test_fsdp_tp_integration 1/1 2025-12-04T12:48:01.7070254Z distributed/_shard/sharded_optim/test_sharded_optim 1/1 2025-12-04T12:48:01.7070437Z distributed/_composable/fsdp/test_fully_shard_state_dict 1/1 2025-12-04T12:48:01.7070600Z distributed/test_c10d_pypg 1/1 2025-12-04T12:48:01.7070731Z distributed/test_pg_wrapper 1/1 2025-12-04T12:48:01.7070885Z distributed/_shard/sharded_tensor/ops/test_binary_cmp 1/1 2025-12-04T12:48:01.7071055Z distributed/nn/jit/test_instantiator 1/1 2025-12-04T12:48:01.7071213Z distributed/_shard/sharding_spec/test_sharding_spec 1/1 2025-12-04T12:48:01.7071367Z distributed/test_nccl 1/1 2025-12-04T12:48:01.7071496Z distributed/fsdp/test_fsdp_misc 1/1 2025-12-04T12:48:01.7071633Z distributed/fsdp/test_fsdp_meta 1/1 2025-12-04T12:48:01.7071774Z distributed/fsdp/test_fsdp_unshard_params 1/1 2025-12-04T12:48:01.7071931Z distributed/checkpoint/test_state_dict_utils 1/1 2025-12-04T12:48:01.7072095Z distributed/_shard/sharded_tensor/ops/test_init 1/1 2025-12-04T12:48:01.7072267Z distributed/_shard/sharded_tensor/ops/test_embedding 1/1 2025-12-04T12:48:01.7072450Z distributed/_shard/sharded_tensor/ops/test_embedding_bag 1/1 2025-12-04T12:48:01.7072646Z distributed/_shard/sharded_tensor/test_sharded_tensor_reshard 1/1 2025-12-04T12:48:01.7072814Z distributed/fsdp/test_fsdp_core 2/2 2025-12-04T12:48:01.7072948Z distributed/test_c10d_ucc 1/1 2025-12-04T12:48:01.7073079Z distributed/test_c10d_common 1/1 2025-12-04T12:48:01.7073222Z distributed/fsdp/test_fsdp_mixed_precision 1/1 2025-12-04T12:48:01.7073367Z distributed/test_c10d_nccl 2/2 2025-12-04T12:48:01.7073493Z Parallel tests (0): 2025-12-04T12:48:01.7073611Z Name: excluded (est. time: 0.0min) 2025-12-04T12:48:01.7073733Z Serial tests (0): 2025-12-04T12:48:01.7073839Z Parallel tests (0): 2025-12-04T12:48:01.7074030Z Running distributed/test_dynamo_distributed 1/1 ... [2025-12-04 12:48:01.703968][2258236.64021791] 2025-12-04T12:48:01.7074241Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T12:48:01.7074655Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_dynamo_distributed.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 12:48:01.704175] 2025-12-04T13:44:25.3957907Z 2025-12-04T13:44:25.3958929Z PRINTING LOG FILE of distributed/test_dynamo_distributed 1/1 (test/test-reports/distributed.test_dynamo_distributed_1.1_47fb19e1d47c844a_.log) 2025-12-04T13:44:25.3962547Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T13:44:25.3963462Z ============================= test session starts ============================== 2025-12-04T13:44:25.3964079Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.3964726Z cachedir: .pytest_cache 2025-12-04T13:44:25.3965307Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.3965777Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.3966010Z configfile: pytest.ini 2025-12-04T13:44:25.3966454Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.3967115Z collecting ... collected 62 items 2025-12-04T13:44:25.3967399Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T13:44:25.3981565Z Running 62 items in this shard: test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_call_method_forward, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_ddp_optimizer_inductor_strides_dont_specialize, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_inductor, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_issue90375, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_symbol_splitting, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_direct, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_indirect, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_no_binding, test/distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_torture_multi, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation_with_fx_cache, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_optimizer_cudagraph, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_get_pg_attr, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_guard_collective, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_multiproc_autotune, test/distributed/test_dynamo_distributed.py::TestMultiProc::test_multiproc_autotune_dynamic_shapes, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_aot_autograd, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_async_subclass_no_specialize, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_compiled_flex_attention_full_model_ddp, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_compiled_flex_attention_local_ddp, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_custom_layer, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_ddp_baseline_aot_eager, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_ddp_baseline_inductor, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_empty_graph_inductor, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_dup_tensors_diff_source, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_dup_tensors_same_source, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_orig_params_assert, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_skip_guards, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_skip_register_attr_or_module, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_staticmethod, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_ctx_manager, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_layout_optimizations_inference, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_layout_optimizations_training, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_transpose, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_higher_order_op, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_ignored_parameters, test/distributed/test_dynamo_distributed.py::TestSingleProc::test_no_split 2025-12-04T13:44:25.3991688Z 2025-12-04T13:44:25.3991881Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_call_method_forward PASSED [4.8360s] [ 1%] 2025-12-04T13:44:25.3992314Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_ddp_optimizer_inductor_strides_dont_specialize PASSED [0.5579s] [ 3%] 2025-12-04T13:44:25.3992786Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_aot_eager SKIPPED [0.0011s] (Unable to import transformers) [ 4%] 2025-12-04T13:44:25.3993278Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_hf_bert_ddp_inductor SKIPPED [0.0005s] (Unable to import transformers) [ 6%] 2025-12-04T13:44:25.3993690Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_issue90375 PASSED [0.0354s] [ 8%] 2025-12-04T13:44:25.3994058Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_symbol_splitting PASSED [0.5282s] [ 9%] 2025-12-04T13:44:25.3994456Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_direct PASSED [0.7877s] [ 11%] 2025-12-04T13:44:25.3994926Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_indirect PASSED [1.1448s] [ 12%] 2025-12-04T13:44:25.3995374Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_no_binding PASSED [1.1477s] [ 14%] 2025-12-04T13:44:25.3995803Z distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_unbacked_symbol_splitting_torture_multi PASSED [0.5324s] [ 16%] 2025-12-04T13:44:25.3996307Z distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation I1204 12:48:17.443000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 4491 2025-12-04T13:44:25.3996792Z I1204 12:48:17.443000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 4492 2025-12-04T13:44:25.3997127Z I1204 12:48:17.444000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 4493 2025-12-04T13:44:25.3997461Z I1204 12:48:17.444000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 4494 2025-12-04T13:44:25.3997768Z PASSED [246.4500s] [ 17%] 2025-12-04T13:44:25.3998123Z distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation_with_fx_cache I1204 12:52:23.894000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 7161 2025-12-04T13:44:25.3998653Z I1204 12:52:23.894000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 7162 2025-12-04T13:44:25.3998980Z I1204 12:52:23.895000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 7163 2025-12-04T13:44:25.3999307Z I1204 12:52:23.895000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 7164 2025-12-04T13:44:25.3999655Z [rank1]:W1204 12:52:31.007000 7162 site-packages/torch/_inductor/compile_fx.py:1219] [0/0] Sleeping for 10 since sleep_sec_TESTING_ONLY is set 2025-12-04T13:44:25.4000018Z [rank3]:W1204 12:52:31.430000 7164 site-packages/torch/_inductor/compile_fx.py:1219] [0/0] Sleeping for 10 since sleep_sec_TESTING_ONLY is set 2025-12-04T13:44:25.4000372Z [rank0]:W1204 12:52:32.263000 7161 site-packages/torch/_inductor/compile_fx.py:1219] [0/0] Sleeping for 10 since sleep_sec_TESTING_ONLY is set 2025-12-04T13:44:25.4001209Z [rank2]:W1204 12:52:32.299000 7163 site-packages/torch/_inductor/compile_fx.py:1219] [0/0] Sleeping for 10 since sleep_sec_TESTING_ONLY is set 2025-12-04T13:44:25.4001557Z [rank0]:W1204 12:53:58.370000 7161 site-packages/torch/_inductor/compile_fx.py:1219] [0/0] Sleeping for 10 since sleep_sec_TESTING_ONLY is set 2025-12-04T13:44:25.4001798Z PASSED [107.5434s] [ 19%] 2025-12-04T13:44:25.4002169Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar I1204 12:54:11.439000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 11154 2025-12-04T13:44:25.4002653Z I1204 12:54:11.439000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 11155 2025-12-04T13:44:25.4002987Z I1204 12:54:11.440000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 11156 2025-12-04T13:44:25.4003340Z I1204 12:54:11.440000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 11157 2025-12-04T13:44:25.4003568Z PASSED [38.9573s] [ 20%] 2025-12-04T13:44:25.4003998Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence I1204 12:54:50.398000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 14577 2025-12-04T13:44:25.4004502Z I1204 12:54:50.398000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 14578 2025-12-04T13:44:25.4004891Z I1204 12:54:50.399000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 14579 2025-12-04T13:44:25.4005223Z I1204 12:54:50.399000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 14580 2025-12-04T13:44:25.4005448Z PASSED [123.7618s] [ 22%] 2025-12-04T13:44:25.4005815Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor I1204 12:56:54.161000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 17884 2025-12-04T13:44:25.4006299Z I1204 12:56:54.162000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 17885 2025-12-04T13:44:25.4006689Z I1204 12:56:54.162000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 17886 2025-12-04T13:44:25.4007029Z I1204 12:56:54.163000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 17887 2025-12-04T13:44:25.4007655Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4008131Z warnings.warn( 2025-12-04T13:44:25.4008563Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4008997Z warnings.warn( 2025-12-04T13:44:25.4009414Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4009856Z warnings.warn( 2025-12-04T13:44:25.4010279Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4010721Z warnings.warn( 2025-12-04T13:44:25.4010828Z PASSED [134.9891s] [ 24%] 2025-12-04T13:44:25.4011189Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch I1204 12:59:09.152000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 21267 2025-12-04T13:44:25.4011670Z I1204 12:59:09.152000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 21268 2025-12-04T13:44:25.4012055Z I1204 12:59:09.153000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 21269 2025-12-04T13:44:25.4012428Z I1204 12:59:09.153000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 21270 2025-12-04T13:44:25.4012681Z PASSED [139.7971s] [ 25%] 2025-12-04T13:44:25.4013075Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective I1204 13:01:28.950000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 24529 2025-12-04T13:44:25.4013583Z I1204 13:01:28.950000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 24530 2025-12-04T13:44:25.4013924Z I1204 13:01:28.951000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 24531 2025-12-04T13:44:25.4014286Z I1204 13:01:28.952000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 24532 2025-12-04T13:44:25.4014515Z woof 2025-12-04T13:44:25.4014604Z woof 2025-12-04T13:44:25.4014688Z woof 2025-12-04T13:44:25.4014820Z woof 2025-12-04T13:44:25.4014907Z woof 2025-12-04T13:44:25.4014992Z woof 2025-12-04T13:44:25.4015077Z woof 2025-12-04T13:44:25.4015161Z woof 2025-12-04T13:44:25.4015241Z woof 2025-12-04T13:44:25.4015324Z woof 2025-12-04T13:44:25.4015406Z woof 2025-12-04T13:44:25.4015500Z woof 2025-12-04T13:44:25.4015603Z PASSED [137.7081s] [ 27%] 2025-12-04T13:44:25.4015999Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source I1204 13:03:46.659000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 27776 2025-12-04T13:44:25.4016464Z I1204 13:03:46.660000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 27777 2025-12-04T13:44:25.4016835Z I1204 13:03:46.660000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 27778 2025-12-04T13:44:25.4017178Z I1204 13:03:46.661000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 27779 2025-12-04T13:44:25.4017409Z PASSED [136.5047s] [ 29%] 2025-12-04T13:44:25.4017853Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source I1204 13:06:03.166000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 31055 2025-12-04T13:44:25.4018334Z I1204 13:06:03.167000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 31056 2025-12-04T13:44:25.4018692Z I1204 13:06:03.167000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 31057 2025-12-04T13:44:25.4019049Z I1204 13:06:03.168000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 31058 2025-12-04T13:44:25.4019282Z PASSED [122.6699s] [ 30%] 2025-12-04T13:44:25.4019900Z distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch SKIPPED [0.0005s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/165988 for platform(s) rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 32%] 2025-12-04T13:44:25.4020748Z distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing I1204 13:08:05.839000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 34378 2025-12-04T13:44:25.4021213Z I1204 13:08:05.839000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 34379 2025-12-04T13:44:25.4021549Z I1204 13:08:05.840000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 34380 2025-12-04T13:44:25.4021908Z I1204 13:08:05.840000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 34381 2025-12-04T13:44:25.4022319Z [rank3]:W1204 13:09:12.444000 34381 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4022775Z [rank0]:W1204 13:09:26.384000 34378 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4023221Z [rank2]:W1204 13:09:28.590000 34380 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4023712Z [rank1]:W1204 13:09:40.591000 34379 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4024362Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4024813Z warnings.warn( 2025-12-04T13:44:25.4025239Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4025675Z warnings.warn( 2025-12-04T13:44:25.4026098Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4026621Z warnings.warn( 2025-12-04T13:44:25.4026976Z [rank1]:[W1204 13:09:56.717681595 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:34260, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4027539Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4028042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x77c08af85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4028525Z frame #1: + 0x6eb755f (0x77bfe515655f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4029153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1df (0x77bfe515210f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4029794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x77bfd512eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4030155Z frame #4: + 0xdc253 (0x77bf95c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4030413Z frame #5: + 0x94ac3 (0x77c0a43f2ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4030660Z frame #6: + 0x1268c0 (0x77c0a44848c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4030817Z 2025-12-04T13:44:25.4031191Z [rank1]:[W1204 13:09:56.719530985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0(default_pg) Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4031764Z [rank1]:[W1204 13:09:57.719629891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:34260, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4032178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4032666Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x77c08af85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4033127Z frame #1: + 0x6eb6c0e (0x77bfe5155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4033716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x77bfe51520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4034380Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x77bfd512eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4034734Z frame #4: + 0xdc253 (0x77bf95c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4034984Z frame #5: + 0x94ac3 (0x77c0a43f2ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4035263Z frame #6: + 0x1268c0 (0x77c0a44848c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4035409Z 2025-12-04T13:44:25.4035672Z [rank1]:[W1204 13:09:57.721563359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0(default_pg) Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4036178Z [rank1]:[W1204 13:09:58.721744533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:34260, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4036573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4037071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x77c08af85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4037585Z frame #1: + 0x6eb6c0e (0x77bfe5155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4038172Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x77bfe51520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4038785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x77bfd512eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4039140Z frame #4: + 0xdc253 (0x77bf95c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4039396Z frame #5: + 0x94ac3 (0x77c0a43f2ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4039651Z frame #6: + 0x1268c0 (0x77c0a44848c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4039804Z 2025-12-04T13:44:25.4040070Z [rank1]:[W1204 13:09:58.724224389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0(default_pg) Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4040740Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4041204Z warnings.warn( 2025-12-04T13:44:25.4041338Z PASSED [115.4603s] [ 33%] 2025-12-04T13:44:25.4041692Z distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess I1204 13:10:01.300000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 37645 2025-12-04T13:44:25.4042174Z I1204 13:10:01.301000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 37646 2025-12-04T13:44:25.4042507Z I1204 13:10:01.302000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 37647 2025-12-04T13:44:25.4042842Z I1204 13:10:01.302000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 37648 2025-12-04T13:44:25.4076443Z [rank0]:W1204 13:13:03.830000 37645 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4077081Z [rank3]:W1204 13:13:03.830000 37648 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4077615Z [rank1]:W1204 13:13:03.834000 37646 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4078069Z [rank2]:W1204 13:13:03.834000 37647 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4078364Z PASSED [194.5600s] [ 35%] 2025-12-04T13:44:25.4078761Z distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_optimizer_cudagraph I1204 13:13:15.862000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 40368 2025-12-04T13:44:25.4079252Z I1204 13:13:15.862000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 40369 2025-12-04T13:44:25.4079608Z I1204 13:13:15.863000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 40370 2025-12-04T13:44:25.4079944Z I1204 13:13:15.863000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 40371 2025-12-04T13:44:25.4080344Z [rank2]:W1204 13:14:53.077000 40370 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4080802Z [rank3]:W1204 13:14:53.077000 40371 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4081254Z [rank0]:W1204 13:14:53.077000 40368 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4082174Z [rank1]:W1204 13:14:53.077000 40369 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4082459Z PASSED [104.7448s] [ 37%] 2025-12-04T13:44:25.4082807Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing I1204 13:15:00.608000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 44282 2025-12-04T13:44:25.4083278Z I1204 13:15:00.609000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 44283 2025-12-04T13:44:25.4083620Z I1204 13:15:00.609000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 44284 2025-12-04T13:44:25.4083964Z I1204 13:15:00.610000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 44285 2025-12-04T13:44:25.4084359Z [rank0]:W1204 13:16:26.983000 44282 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4084809Z [rank1]:W1204 13:16:46.079000 44283 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4085261Z [rank2]:W1204 13:16:57.381000 44284 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4085709Z [rank3]:W1204 13:16:59.617000 44285 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4086379Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4086849Z warnings.warn( 2025-12-04T13:44:25.4087297Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4087882Z warnings.warn( 2025-12-04T13:44:25.4088336Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4088772Z warnings.warn( 2025-12-04T13:44:25.4089254Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.4089706Z warnings.warn( 2025-12-04T13:44:25.4089811Z PASSED [135.5863s] [ 38%] 2025-12-04T13:44:25.4090146Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager I1204 13:17:16.196000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 47549 2025-12-04T13:44:25.4090613Z I1204 13:17:16.197000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 47550 2025-12-04T13:44:25.4090980Z I1204 13:17:16.197000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 47551 2025-12-04T13:44:25.4091339Z I1204 13:17:16.198000 703 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 47552 2025-12-04T13:44:25.4091761Z [rank2]:W1204 13:17:24.395000 47551 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4092223Z [rank3]:W1204 13:17:24.956000 47552 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4092678Z [rank0]:W1204 13:17:45.113000 47549 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4093127Z [rank1]:W1204 13:17:45.126000 47550 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4093438Z Exception in thread Thread-1 (_event_listener): 2025-12-04T13:44:25.4093597Z Traceback (most recent call last): 2025-12-04T13:44:25.4093806Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T13:44:25.4094026Z Exception in thread Thread-1 (_event_listener): 2025-12-04T13:44:25.4094167Z self.run() 2025-12-04T13:44:25.4094283Z Traceback (most recent call last): 2025-12-04T13:44:25.4094469Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T13:44:25.4095145Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T13:44:25.4095358Z Exception in thread Thread-1 (_event_listener): 2025-12-04T13:44:25.4095515Z self._target(*self._args, **self._kwargs) 2025-12-04T13:44:25.4095673Z Exception in thread Thread-1 (_event_listener): 2025-12-04T13:44:25.4095953Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T13:44:25.4096226Z Traceback (most recent call last): 2025-12-04T13:44:25.4096378Z Traceback (most recent call last): 2025-12-04T13:44:25.4096571Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T13:44:25.4096863Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T13:44:25.4097062Z event = parent_pipe.recv() 2025-12-04T13:44:25.4097264Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T13:44:25.4097538Z buf = self._recv_bytes() 2025-12-04T13:44:25.4097743Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T13:44:25.4097965Z buf = self._recv(4) 2025-12-04T13:44:25.4098163Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T13:44:25.4098388Z raise EOFError 2025-12-04T13:44:25.4098522Z EOFError 2025-12-04T13:44:25.4098641Z self.run() 2025-12-04T13:44:25.4098836Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T13:44:25.4099096Z self.run() 2025-12-04T13:44:25.4099232Z self._target(*self._args, **self._kwargs) 2025-12-04T13:44:25.4099442Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T13:44:25.4099808Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T13:44:25.4100088Z self.run() 2025-12-04T13:44:25.4100258Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T13:44:25.4100451Z self._target(*self._args, **self._kwargs) 2025-12-04T13:44:25.4100768Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T13:44:25.4101038Z event = parent_pipe.recv() 2025-12-04T13:44:25.4101257Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T13:44:25.4101483Z self._target(*self._args, **self._kwargs) 2025-12-04T13:44:25.4101788Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T13:44:25.4102063Z buf = self._recv_bytes() 2025-12-04T13:44:25.4102309Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T13:44:25.4102547Z event = parent_pipe.recv() 2025-12-04T13:44:25.4102768Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T13:44:25.4102983Z buf = self._recv(4) 2025-12-04T13:44:25.4103182Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T13:44:25.4103411Z event = parent_pipe.recv() 2025-12-04T13:44:25.4103598Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T13:44:25.4103791Z buf = self._recv_bytes() 2025-12-04T13:44:25.4103992Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T13:44:25.4104188Z raise EOFError 2025-12-04T13:44:25.4104280Z EOFError 2025-12-04T13:44:25.4104375Z buf = self._recv_bytes() 2025-12-04T13:44:25.4104577Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T13:44:25.4104803Z buf = self._recv(4) 2025-12-04T13:44:25.4104982Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T13:44:25.4105174Z raise EOFError 2025-12-04T13:44:25.4105267Z EOFError 2025-12-04T13:44:25.4105362Z buf = self._recv(4) 2025-12-04T13:44:25.4105548Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T13:44:25.4105751Z raise EOFError 2025-12-04T13:44:25.4105854Z EOFError 2025-12-04T13:44:25.4105907Z 2025-12-04T13:44:25.4105911Z 2025-12-04T13:44:25.4106195Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml - 2025-12-04T13:44:25.4106605Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T13:44:25.4106882Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py:1036: KeyboardInterrupt 2025-12-04T13:44:25.4107168Z (to show a full traceback on KeyboardInterrupt use --full-trace) 2025-12-04T13:44:25.4107373Z ================== 21 passed, 3 skipped in 1793.96s (0:29:53) ================== 2025-12-04T13:44:25.4107622Z Command took >30min, returning 124 2025-12-04T13:44:25.4107747Z Got exit code 124 2025-12-04T13:44:25.4107853Z Retrying single test... 2025-12-04T13:44:25.4108133Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-37dceaaeb6583f22.xml 2025-12-04T13:44:25.4108483Z ============================= test session starts ============================== 2025-12-04T13:44:25.4108709Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.4108915Z cachedir: .pytest_cache 2025-12-04T13:44:25.4109153Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.4109409Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.4109545Z configfile: pytest.ini 2025-12-04T13:44:25.4109788Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.4110072Z collecting ... collected 62 items / 61 deselected / 1 selected 2025-12-04T13:44:25.4110376Z stepcurrent: skipping 24 already run items. Running only test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager 2025-12-04T13:44:25.4110640Z Running 1 items in this shard 2025-12-04T13:44:25.4110721Z 2025-12-04T13:44:25.4110989Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager I1204 13:18:12.543000 50200 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 50856 2025-12-04T13:44:25.4111445Z I1204 13:18:12.543000 50200 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 50857 2025-12-04T13:44:25.4111794Z I1204 13:18:12.544000 50200 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 50858 2025-12-04T13:44:25.4112147Z I1204 13:18:12.545000 50200 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 50859 2025-12-04T13:44:25.4112494Z [W1204 13:18:18.754763113 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T13:44:25.4112854Z [W1204 13:18:18.754795222 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4113060Z 2025-12-04T13:44:25.4113273Z [rank1]:W1204 13:18:22.643000 50857 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4113815Z [W1204 13:18:26.688538922 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4114018Z 2025-12-04T13:44:25.4114159Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] Caught exception: 2025-12-04T13:44:25.4114495Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] Traceback (most recent call last): 2025-12-04T13:44:25.4114982Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4115465Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] getattr(self, test_name)() 2025-12-04T13:44:25.4115996Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4116441Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] fn() 2025-12-04T13:44:25.4116882Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4117345Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] method(*args, **kwargs) 2025-12-04T13:44:25.4117822Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4118263Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwds) 2025-12-04T13:44:25.4118739Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4119215Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4119678Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4120178Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4120656Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4121067Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] return next(self.gen) 2025-12-04T13:44:25.4121556Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4122113Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4122622Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4123077Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4123528Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4123995Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4124482Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4124988Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4125561Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4126039Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] store = _create_c10d_store( 2025-12-04T13:44:25.4126505Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4126953Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] return TCPStore( 2025-12-04T13:44:25.4127588Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4128375Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4128905Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] C++ CapturedTraceback: 2025-12-04T13:44:25.4129590Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4130403Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4130928Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4131445Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4131961Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4132459Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4132984Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4137059Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4141218Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4141714Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4142194Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4142696Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4143216Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4143683Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4144142Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4144540Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4144988Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4145528Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4146049Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4146566Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4147165Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4147738Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4148265Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4148788Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4149307Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4149783Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4150273Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4150770Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4151227Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4151709Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4152232Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4152754Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4153275Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4153820Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4154395Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4154959Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4155486Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4155969Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4156450Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4156993Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4157560Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4158043Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4158559Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4159082Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4159632Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4160176Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4160691Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4161207Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4161725Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4162218Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4162676Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4163162Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4163686Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4164206Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4164764Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4165289Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4165834Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4166360Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4166920Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4167819Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4168474Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4168974Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4169514Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4170037Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4170600Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4171139Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4171686Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4172179Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4172720Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4173172Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #73 _start from ??:0 2025-12-04T13:44:25.4173536Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] #74 from ??:0 2025-12-04T13:44:25.4173914Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4174225Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4174892Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4175440Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4175930Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4176354Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4176828Z E1204 13:18:26.757000 50856 site-packages/torch/testing/_internal/common_distributed.py:935] exiting process 0 with exit code: 10 2025-12-04T13:44:25.4177228Z FAILED [15.0257s] [100%] 2025-12-04T13:44:25.4177428Z 2025-12-04T13:44:25.4177565Z =================================== FAILURES =================================== 2025-12-04T13:44:25.4177782Z ______________________ TestMultiProc.test_fsdp_aot_eager _______________________ 2025-12-04T13:44:25.4178032Z Traceback (most recent call last): 2025-12-04T13:44:25.4178364Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T13:44:25.4178767Z self._join_processes(fn) 2025-12-04T13:44:25.4179149Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T13:44:25.4179463Z self._check_return_codes(fn, elapsed_time) 2025-12-04T13:44:25.4179780Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1079, in _check_return_codes 2025-12-04T13:44:25.4180120Z raise RuntimeError(error) 2025-12-04T13:44:25.4180317Z RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4180564Z Traceback (most recent call last): 2025-12-04T13:44:25.4180989Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4181289Z getattr(self, test_name)() 2025-12-04T13:44:25.4181563Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4181931Z fn() 2025-12-04T13:44:25.4182312Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4182591Z method(*args, **kwargs) 2025-12-04T13:44:25.4182790Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4183025Z return func(*args, **kwds) 2025-12-04T13:44:25.4183392Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4183681Z return func(*args, **kwargs) 2025-12-04T13:44:25.4183939Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4184276Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4184528Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4184852Z return next(self.gen) 2025-12-04T13:44:25.4185191Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4185585Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4185905Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4186218Z return func(*args, **kwargs) 2025-12-04T13:44:25.4186485Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4186747Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4187205Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4187566Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4187979Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4188261Z store = _create_c10d_store( 2025-12-04T13:44:25.4188623Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4188953Z return TCPStore( 2025-12-04T13:44:25.4189304Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4189880Z Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4190266Z C++ CapturedTraceback: 2025-12-04T13:44:25.4190789Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4191396Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4191717Z #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4192074Z #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4192467Z #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4192742Z #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4193061Z #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4196979Z #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4201060Z #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4201339Z #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4201639Z #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4201947Z #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4202326Z #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4202603Z #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4202905Z #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4203123Z #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4203409Z #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4203757Z #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4204089Z #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4204408Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4204751Z #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4205067Z #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4205376Z #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4295446Z #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4295749Z #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4296026Z #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4296308Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4296610Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4296857Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4297135Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4297435Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4297764Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4298049Z #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4298409Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4298699Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4298978Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4299256Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4299498Z #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4299767Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4300013Z #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4300311Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4300559Z #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4300807Z #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4301095Z #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4301382Z #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4301670Z #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4301958Z #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4302246Z #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4302531Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4302774Z #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4302997Z #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4303239Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4303515Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4303789Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4304085Z #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4304375Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4304687Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4304977Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4305273Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4305587Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4305864Z #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4306165Z #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4306403Z #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4306643Z #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4306916Z #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4307177Z #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4307413Z #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4307765Z #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4307998Z #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4308147Z #73 _start from ??:0 2025-12-04T13:44:25.4308263Z #74 from ??:0 2025-12-04T13:44:25.4308346Z 2025-12-04T13:44:25.4308353Z 2025-12-04T13:44:25.4308430Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4308681Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4308865Z 2025-12-04T13:44:25.4308957Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4309095Z 2025-12-04T13:44:25.4309136Z 2025-12-04T13:44:25.4309219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:44:25.4309432Z Process 0 terminated with exit code 10, terminating remaining processes. 2025-12-04T13:44:25.4309828Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-37dceaaeb6583f22.xml - 2025-12-04T13:44:25.4310190Z =========================== short test summary info ============================ 2025-12-04T13:44:25.4310522Z FAILED [15.0257s] distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager - RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4310821Z Traceback (most recent call last): 2025-12-04T13:44:25.4311070Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4311318Z getattr(self, test_name)() 2025-12-04T13:44:25.4311570Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4311858Z fn() 2025-12-04T13:44:25.4312092Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4312343Z method(*args, **kwargs) 2025-12-04T13:44:25.4312517Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4312705Z return func(*args, **kwds) 2025-12-04T13:44:25.4312946Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4313190Z return func(*args, **kwargs) 2025-12-04T13:44:25.4313409Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4313665Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4313870Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4314052Z return next(self.gen) 2025-12-04T13:44:25.4314326Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4314644Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4314914Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4315138Z return func(*args, **kwargs) 2025-12-04T13:44:25.4315349Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4315579Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4315840Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4316121Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4316386Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4316667Z store = _create_c10d_store( 2025-12-04T13:44:25.4316898Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4317130Z return TCPStore( 2025-12-04T13:44:25.4317438Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4317997Z Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4318291Z C++ CapturedTraceback: 2025-12-04T13:44:25.4318743Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4319342Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4319625Z #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4319906Z #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4320188Z #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4320411Z #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4320702Z #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4324504Z #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4328615Z #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4328850Z #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4329094Z #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4329365Z #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4329819Z #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4330051Z #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4330278Z #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4330460Z #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4330641Z #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4330905Z #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4331185Z #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4331468Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4331749Z #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4332028Z #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4332313Z #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4332596Z #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4332868Z #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4333089Z #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4333332Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4333584Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4333802Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4334043Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4334323Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4334610Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4334893Z #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4335180Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4335512Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4335792Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4336071Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4336312Z #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4336560Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4336806Z #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4337047Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4337318Z #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4337626Z #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4337908Z #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4338215Z #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4338495Z #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4338773Z #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4339048Z #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4339325Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4339578Z #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4339798Z #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4340048Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4340331Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4340604Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4340915Z #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4341214Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4341492Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4341768Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4342043Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4342318Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4342563Z #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4342797Z #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4343021Z #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4343248Z #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4343511Z #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4343798Z #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4344030Z #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4344263Z #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4344531Z #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4344683Z #73 _start from ??:0 2025-12-04T13:44:25.4344813Z #74 from ??:0 2025-12-04T13:44:25.4344900Z 2025-12-04T13:44:25.4344902Z 2025-12-04T13:44:25.4344985Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4345248Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4345426Z 2025-12-04T13:44:25.4345521Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4345716Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T13:44:25.4345921Z ====================== 1 failed, 61 deselected in 15.04s ======================= 2025-12-04T13:44:25.4346068Z Got exit code 1 2025-12-04T13:44:25.4346171Z Retrying single test... 2025-12-04T13:44:25.4346451Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-0634d46fd188d3db.xml 2025-12-04T13:44:25.4346753Z ============================= test session starts ============================== 2025-12-04T13:44:25.4346973Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.4347174Z cachedir: .pytest_cache 2025-12-04T13:44:25.4347419Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.4347745Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.4347863Z configfile: pytest.ini 2025-12-04T13:44:25.4348094Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.4348376Z collecting ... collected 62 items / 61 deselected / 1 selected 2025-12-04T13:44:25.4348669Z stepcurrent: skipping 24 already run items. Running only test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager 2025-12-04T13:44:25.4348930Z Running 1 items in this shard 2025-12-04T13:44:25.4349005Z 2025-12-04T13:44:25.4349269Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager I1204 13:18:33.532000 53497 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 54153 2025-12-04T13:44:25.4349717Z I1204 13:18:33.532000 53497 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 54154 2025-12-04T13:44:25.4350061Z I1204 13:18:33.533000 53497 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 54155 2025-12-04T13:44:25.4350410Z I1204 13:18:33.533000 53497 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 54156 2025-12-04T13:44:25.4350753Z [W1204 13:18:38.644634108 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T13:44:25.4351101Z [W1204 13:18:38.644657928 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4351301Z 2025-12-04T13:44:25.4351514Z [rank3]:W1204 13:18:41.731000 54156 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4351970Z [rank2]:W1204 13:18:43.595000 54155 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4352373Z [W1204 13:18:46.366074253 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4352572Z 2025-12-04T13:44:25.4352712Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] Caught exception: 2025-12-04T13:44:25.4353083Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] Traceback (most recent call last): 2025-12-04T13:44:25.4353566Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4354036Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] getattr(self, test_name)() 2025-12-04T13:44:25.4354501Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4354961Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] fn() 2025-12-04T13:44:25.4355388Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4355839Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] method(*args, **kwargs) 2025-12-04T13:44:25.4356222Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4356612Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwds) 2025-12-04T13:44:25.4357080Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4357575Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4358033Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4358533Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4358984Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4359402Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] return next(self.gen) 2025-12-04T13:44:25.4359893Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4360461Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4360969Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4361417Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4361865Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4362325Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4362856Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4363365Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4363868Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4364344Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] store = _create_c10d_store( 2025-12-04T13:44:25.4364845Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4365297Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] return TCPStore( 2025-12-04T13:44:25.4365842Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4366580Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4367099Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] C++ CapturedTraceback: 2025-12-04T13:44:25.4367879Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4368684Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4369208Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4369728Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4370241Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4370692Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4371217Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4375278Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4379347Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4379814Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4380291Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4380792Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4381298Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4381794Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4382254Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4382657Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4383064Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4383562Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4384110Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4384626Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4385140Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4385657Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4386181Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4386697Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4387184Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4387713Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4388197Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4388683Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4389137Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4389612Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4390127Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4390641Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4391188Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4391712Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4392227Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4392742Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4393257Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4393763Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4394242Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4394721Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4395197Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4395678Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4396156Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4396674Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4397184Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4397729Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4398240Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4398749Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4399261Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4399739Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4400194Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4400715Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4401231Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4401749Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4402265Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4402777Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4403319Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4403826Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4404333Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4404842Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4405321Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4405783Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4406242Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4406704Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4407196Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4407712Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4408166Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4408609Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4409090Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4409450Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #73 _start from ??:0 2025-12-04T13:44:25.4409783Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] #74 from ??:0 2025-12-04T13:44:25.4410086Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4410382Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4410757Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4411250Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4411655Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4412010Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4412470Z E1204 13:18:46.434000 54153 site-packages/torch/testing/_internal/common_distributed.py:935] exiting process 0 with exit code: 10 2025-12-04T13:44:25.4412743Z FAILED [13.7263s] [100%] 2025-12-04T13:44:25.4412813Z 2025-12-04T13:44:25.4412871Z =================================== FAILURES =================================== 2025-12-04T13:44:25.4413049Z ______________________ TestMultiProc.test_fsdp_aot_eager _______________________ 2025-12-04T13:44:25.4413213Z Traceback (most recent call last): 2025-12-04T13:44:25.4413462Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T13:44:25.4413708Z self._join_processes(fn) 2025-12-04T13:44:25.4413991Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T13:44:25.4414265Z self._check_return_codes(fn, elapsed_time) 2025-12-04T13:44:25.4414538Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1079, in _check_return_codes 2025-12-04T13:44:25.4414804Z raise RuntimeError(error) 2025-12-04T13:44:25.4414959Z RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4415121Z Traceback (most recent call last): 2025-12-04T13:44:25.4415363Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4415611Z getattr(self, test_name)() 2025-12-04T13:44:25.4415848Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4416088Z fn() 2025-12-04T13:44:25.4416298Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4416541Z method(*args, **kwargs) 2025-12-04T13:44:25.4416707Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4416883Z return func(*args, **kwds) 2025-12-04T13:44:25.4417365Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4417685Z return func(*args, **kwargs) 2025-12-04T13:44:25.4417953Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4418256Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4418502Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4418734Z return next(self.gen) 2025-12-04T13:44:25.4419030Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4419382Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4419712Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4420004Z return func(*args, **kwargs) 2025-12-04T13:44:25.4420262Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4420538Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4420830Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4421157Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4421460Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4421734Z store = _create_c10d_store( 2025-12-04T13:44:25.4422025Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4422325Z return TCPStore( 2025-12-04T13:44:25.4422683Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4423220Z Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4423560Z C++ CapturedTraceback: 2025-12-04T13:44:25.4424070Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4424671Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4425017Z #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4425342Z #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4425659Z #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4425923Z #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4426251Z #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4430180Z #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4434049Z #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4434341Z #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4434620Z #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4434916Z #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4435240Z #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4435505Z #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4435770Z #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4435994Z #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4436206Z #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4436523Z #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4436839Z #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4437154Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4437543Z #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4437857Z #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4438193Z #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4438505Z #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4438791Z #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4439065Z #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4439335Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4439617Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4439888Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4440199Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4440524Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4440841Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4441156Z #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4441498Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4441813Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4442149Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4442494Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4442772Z #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4443074Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4443345Z #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4443611Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4443908Z #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4444175Z #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4444491Z #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4444820Z #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4445130Z #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4445460Z #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4445770Z #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4446074Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4446378Z #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4446632Z #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4446915Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4447235Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4447615Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4447943Z #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4448254Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4448567Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4448899Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4449216Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4449675Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4449988Z #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4450261Z #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4450592Z #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4450855Z #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4451190Z #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4451475Z #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4451721Z #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4451993Z #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4452225Z #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4452445Z #73 _start from ??:0 2025-12-04T13:44:25.4452600Z #74 from ??:0 2025-12-04T13:44:25.4452716Z 2025-12-04T13:44:25.4452718Z 2025-12-04T13:44:25.4452807Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4453124Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4453327Z 2025-12-04T13:44:25.4453430Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4453581Z 2025-12-04T13:44:25.4453583Z 2025-12-04T13:44:25.4453669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:44:25.4453935Z Process 0 terminated with exit code 10, terminating remaining processes. 2025-12-04T13:44:25.4454351Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-0634d46fd188d3db.xml - 2025-12-04T13:44:25.4454763Z =========================== short test summary info ============================ 2025-12-04T13:44:25.4455107Z FAILED [13.7263s] distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager - RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4455420Z Traceback (most recent call last): 2025-12-04T13:44:25.4455739Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4456021Z getattr(self, test_name)() 2025-12-04T13:44:25.4456306Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4456584Z fn() 2025-12-04T13:44:25.4456822Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4457114Z method(*args, **kwargs) 2025-12-04T13:44:25.4457308Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4457573Z return func(*args, **kwds) 2025-12-04T13:44:25.4457869Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4458150Z return func(*args, **kwargs) 2025-12-04T13:44:25.4458422Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 681, in test_fsdp_aot_eager 2025-12-04T13:44:25.4458726Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4458970Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4459196Z return next(self.gen) 2025-12-04T13:44:25.4459489Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4459842Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4460171Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4460436Z return func(*args, **kwargs) 2025-12-04T13:44:25.4460726Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4460998Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4461285Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4461609Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4461921Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4462200Z store = _create_c10d_store( 2025-12-04T13:44:25.4462491Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4462795Z return TCPStore( 2025-12-04T13:44:25.4463158Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4463703Z Exception raised from makeWithPort at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:313 (most recent call first): 2025-12-04T13:44:25.4464030Z C++ CapturedTraceback: 2025-12-04T13:44:25.4464536Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4465129Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4465443Z #6 c10d::detail::UvTcpServer::makeWithPort(uv_loop_s*, unsigned short, bool) from :0 2025-12-04T13:44:25.4465798Z #7 c10d::detail::LibUVStoreDaemon::init(c10d::TCPStoreOptions const&) [clone .cold] from TCPStoreLibUvBackend.cpp:0 2025-12-04T13:44:25.4466111Z #8 c10d::detail::create_libuv_tcpstore_backend(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4466378Z #9 c10d::detail::TCPServer::start(c10d::TCPStoreOptions const&) from :0 2025-12-04T13:44:25.4466704Z #10 c10d::TCPStore::TCPStore(std::__cxx11::basic_string, std::allocator >, c10d::TCPStoreOptions const&) from ??:0 2025-12-04T13:44:25.4470707Z #11 pybind11::cpp_function::initialize, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#50}, pybind11::detail::void_type (*)(), c10::intrusive_ptr > (std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::detail::void_type ()>::execute > >, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24]) &&::{lambda(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool)#1}, void, pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v, char [24]>(pybind11::class_ > >&&, void (*)(pybind11::detail::value_and_holder&, std::__cxx11::basic_string, std::allocator > const&, unsigned short, std::optional, bool, std::chrono::duration >, bool, bool, std::optional, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&, char const (&) [24])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 2025-12-04T13:44:25.4587835Z #12 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.4588148Z #13 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.4588444Z #14 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4588761Z #15 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4589068Z #16 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4589324Z #17 slot_tp_init from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7737 2025-12-04T13:44:25.4589569Z #18 type_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:1135 2025-12-04T13:44:25.4589784Z #19 pybind11_meta_call from :0 2025-12-04T13:44:25.4589974Z #20 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.4590253Z #21 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.4590554Z #22 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4590860Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4591154Z #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4591452Z #25 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4591752Z #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4592047Z #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4592309Z #28 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4592542Z #29 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4592797Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4593065Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4593301Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4593885Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4594186Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4594482Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4594775Z #36 cfunction_vectorcall_FASTCALL from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:430 2025-12-04T13:44:25.4595076Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4595371Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4595667Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4596052Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4596306Z #41 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4596558Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4596815Z #43 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4597068Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4597327Z #45 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4597706Z #46 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4598012Z #47 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4598310Z #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4598603Z #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4598908Z #50 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4599204Z #51 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4599496Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4599767Z #53 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.4599998Z #54 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.4600257Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4600552Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4637935Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4638252Z #58 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4638603Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4638942Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4639256Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4639559Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.4639871Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.4640143Z #64 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.4640393Z #65 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.4717976Z #66 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.4718377Z #67 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.4718648Z #68 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.4739481Z #69 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.4739707Z #70 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.4739934Z #71 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.4740135Z #72 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.4740287Z #73 _start from ??:0 2025-12-04T13:44:25.4740522Z #74 from ??:0 2025-12-04T13:44:25.4740611Z 2025-12-04T13:44:25.4740614Z 2025-12-04T13:44:25.4740696Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4740963Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4741146Z 2025-12-04T13:44:25.4741242Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4741442Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T13:44:25.4741620Z ====================== 1 failed, 61 deselected in 13.74s ======================= 2025-12-04T13:44:25.4741773Z Got exit code 1 2025-12-04T13:44:25.4741999Z FAILED CONSISTENTLY: test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager 2025-12-04T13:44:25.4742300Z Test failed consistently, continuing with the rest of the tests due to continue-through-error being set 2025-12-04T13:44:25.4742687Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-bce0960dce443bc0.xml 2025-12-04T13:44:25.4742993Z ============================= test session starts ============================== 2025-12-04T13:44:25.4743225Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.4743427Z cachedir: .pytest_cache 2025-12-04T13:44:25.4743663Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.4743916Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.4744045Z configfile: pytest.ini 2025-12-04T13:44:25.4744282Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.4744567Z collecting ... collected 62 items / 25 deselected / 37 selected 2025-12-04T13:44:25.4744742Z stepcurrent: skipping 25 already run items. 2025-12-04T13:44:25.4744889Z Running 37 items in this shard 2025-12-04T13:44:25.4744970Z 2025-12-04T13:44:25.4745239Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor I1204 13:18:53.143000 56796 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 57452 2025-12-04T13:44:25.4745696Z I1204 13:18:53.144000 56796 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 57453 2025-12-04T13:44:25.4746046Z I1204 13:18:53.144000 56796 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 57454 2025-12-04T13:44:25.4746385Z I1204 13:18:53.145000 56796 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 57455 2025-12-04T13:44:25.4746791Z [rank3]:W1204 13:19:01.289000 57455 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4747248Z [rank2]:W1204 13:19:03.389000 57454 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4747701Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] Caught exception: 2025-12-04T13:44:25.4748029Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] Traceback (most recent call last): 2025-12-04T13:44:25.4748512Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4748987Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] getattr(self, test_name)() 2025-12-04T13:44:25.4749451Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4749921Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] fn() 2025-12-04T13:44:25.4750356Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4750809Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] method(*args, **kwargs) 2025-12-04T13:44:25.4751198Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4751592Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwds) 2025-12-04T13:44:25.4752058Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4752528Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4752987Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 803, in test_fsdp_inductor 2025-12-04T13:44:25.4753480Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4753926Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4754327Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] return next(self.gen) 2025-12-04T13:44:25.4754818Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4755404Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4755930Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4756386Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4756839Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4757328Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4757852Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4758354Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4758852Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4759377Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] store = _create_c10d_store( 2025-12-04T13:44:25.4759838Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4760285Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] return TCPStore( 2025-12-04T13:44:25.4760823Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4761333Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4761675Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4762164Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.4762569Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4762925Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4763328Z E1204 13:19:04.194000 57452 site-packages/torch/testing/_internal/common_distributed.py:935] exiting process 0 with exit code: 10 2025-12-04T13:44:25.4763578Z FAILED [11.9230s] [ 2%] 2025-12-04T13:44:25.4763656Z 2025-12-04T13:44:25.4763720Z =================================== FAILURES =================================== 2025-12-04T13:44:25.4763908Z _______________________ TestMultiProc.test_fsdp_inductor _______________________ 2025-12-04T13:44:25.4764083Z Traceback (most recent call last): 2025-12-04T13:44:25.4764345Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T13:44:25.4764601Z self._join_processes(fn) 2025-12-04T13:44:25.4764859Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T13:44:25.4765134Z self._check_return_codes(fn, elapsed_time) 2025-12-04T13:44:25.4765324Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1079, in _check_return_codes 2025-12-04T13:44:25.4765372Z raise RuntimeError(error) 2025-12-04T13:44:25.4765462Z RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4765514Z Traceback (most recent call last): 2025-12-04T13:44:25.4765685Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4765765Z getattr(self, test_name)() 2025-12-04T13:44:25.4765934Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4765975Z fn() 2025-12-04T13:44:25.4766136Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4766182Z method(*args, **kwargs) 2025-12-04T13:44:25.4766280Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4766326Z return func(*args, **kwds) 2025-12-04T13:44:25.4766493Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4766564Z return func(*args, **kwargs) 2025-12-04T13:44:25.4766717Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 803, in test_fsdp_inductor 2025-12-04T13:44:25.4766801Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4766902Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4766954Z return next(self.gen) 2025-12-04T13:44:25.4767145Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4767244Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4767391Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4767444Z return func(*args, **kwargs) 2025-12-04T13:44:25.4767624Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4767684Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4767860Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4767949Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4768117Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4768167Z store = _create_c10d_store( 2025-12-04T13:44:25.4768328Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4768373Z return TCPStore( 2025-12-04T13:44:25.4768613Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4768618Z 2025-12-04T13:44:25.4768700Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4768842Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.4768845Z 2025-12-04T13:44:25.4768934Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4768937Z 2025-12-04T13:44:25.4768938Z 2025-12-04T13:44:25.4769021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:44:25.4769109Z Process 0 terminated with exit code 10, terminating remaining processes. 2025-12-04T13:44:25.4769363Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-bce0960dce443bc0.xml - 2025-12-04T13:44:25.4769426Z =========================== short test summary info ============================ 2025-12-04T13:44:25.4769636Z FAILED [11.9230s] distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor - RuntimeError: Process 0 exited with error code 10 and exception: 2025-12-04T13:44:25.4769685Z Traceback (most recent call last): 2025-12-04T13:44:25.4769880Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4769926Z getattr(self, test_name)() 2025-12-04T13:44:25.4770087Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4770123Z fn() 2025-12-04T13:44:25.4770277Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4770319Z method(*args, **kwargs) 2025-12-04T13:44:25.4770411Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4770481Z return func(*args, **kwds) 2025-12-04T13:44:25.4770643Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4770690Z return func(*args, **kwargs) 2025-12-04T13:44:25.4770836Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 803, in test_fsdp_inductor 2025-12-04T13:44:25.4770909Z with _dynamo_dist_per_rank_init(self.rank, self.world_size): 2025-12-04T13:44:25.4771004Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 135, in __enter__ 2025-12-04T13:44:25.4771048Z return next(self.gen) 2025-12-04T13:44:25.4771234Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1597, in _dynamo_dist_per_rank_init 2025-12-04T13:44:25.4771325Z c10d.init_process_group(backend=backend, rank=rank, world_size=world_size) 2025-12-04T13:44:25.4771464Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4771513Z return func(*args, **kwargs) 2025-12-04T13:44:25.4771651Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper 2025-12-04T13:44:25.4771703Z func_return = func(*args, **kwargs) 2025-12-04T13:44:25.4771872Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1803, in init_process_group 2025-12-04T13:44:25.4771935Z store, rank, world_size = next(rendezvous_iterator) 2025-12-04T13:44:25.4772101Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 281, in _env_rendezvous_handler 2025-12-04T13:44:25.4772145Z store = _create_c10d_store( 2025-12-04T13:44:25.4772303Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 200, in _create_c10d_store 2025-12-04T13:44:25.4772345Z return TCPStore( 2025-12-04T13:44:25.4772587Z torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 6789, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use 2025-12-04T13:44:25.4772593Z 2025-12-04T13:44:25.4772669Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4772810Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.4772813Z 2025-12-04T13:44:25.4772899Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4772967Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T13:44:25.4773032Z ====================== 1 failed, 25 deselected in 11.93s ======================= 2025-12-04T13:44:25.4773076Z Got exit code 1 2025-12-04T13:44:25.4773116Z Retrying single test... 2025-12-04T13:44:25.4773319Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-002c6ad5ba4d16ec.xml 2025-12-04T13:44:25.4773381Z ============================= test session starts ============================== 2025-12-04T13:44:25.4773520Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.4773563Z cachedir: .pytest_cache 2025-12-04T13:44:25.4773725Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.4773776Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.4773826Z configfile: pytest.ini 2025-12-04T13:44:25.4773993Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.4774077Z collecting ... collected 62 items / 61 deselected / 1 selected 2025-12-04T13:44:25.4774256Z stepcurrent: skipping 25 already run items. Running only test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor 2025-12-04T13:44:25.4774352Z Running 1 items in this shard 2025-12-04T13:44:25.4774354Z 2025-12-04T13:44:25.4774612Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor I1204 13:19:12.724000 60091 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 60747 2025-12-04T13:44:25.4774772Z I1204 13:19:12.725000 60091 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 60748 2025-12-04T13:44:25.4774932Z I1204 13:19:12.726000 60091 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 60749 2025-12-04T13:44:25.4775085Z I1204 13:19:12.727000 60091 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 60750 2025-12-04T13:44:25.4775301Z [rank2]:W1204 13:19:21.758000 60749 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.4775451Z [W1204 13:19:21.892851898 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T13:44:25.4775616Z [W1204 13:19:21.892874457 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4775618Z 2025-12-04T13:44:25.4775753Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] Caught exception: 2025-12-04T13:44:25.4775915Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] Traceback (most recent call last): 2025-12-04T13:44:25.4776198Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4776352Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] getattr(self, test_name)() 2025-12-04T13:44:25.4776638Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4776757Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] fn() 2025-12-04T13:44:25.4777032Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4777175Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] method(*args, **kwargs) 2025-12-04T13:44:25.4777387Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4787589Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwds) 2025-12-04T13:44:25.4787900Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4788057Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4788322Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 686, in test_fsdp_aot_eager 2025-12-04T13:44:25.4788474Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] outputs = fsdp_m(inputs) 2025-12-04T13:44:25.4788727Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441, in __call__ 2025-12-04T13:44:25.4788935Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return super().__call__(*args, **kwargs) 2025-12-04T13:44:25.4789207Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl 2025-12-04T13:44:25.4789375Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return self._call_impl(*args, **kwargs) 2025-12-04T13:44:25.4789640Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl 2025-12-04T13:44:25.4789800Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return forward_call(*args, **kwargs) 2025-12-04T13:44:25.4790072Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926, in compile_wrapper 2025-12-04T13:44:25.4790221Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.4790483Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 2194, in __call__ 2025-12-04T13:44:25.4790647Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] result = self._torchdynamo_orig_backend( 2025-12-04T13:44:25.4790908Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1937, in __call__ 2025-12-04T13:44:25.4791062Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] result = self._inner_convert( 2025-12-04T13:44:25.4791322Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 706, in __call__ 2025-12-04T13:44:25.4791465Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] result = _compile( 2025-12-04T13:44:25.4791722Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1807, in _compile 2025-12-04T13:44:25.4791887Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] raise InternalTorchDynamoError( 2025-12-04T13:44:25.4792163Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1744, in _compile 2025-12-04T13:44:25.4792371Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] guarded_code, tracer_output = compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.4792631Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function 2025-12-04T13:44:25.4792793Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return function(*args, **kwargs) 2025-12-04T13:44:25.4793061Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1425, in compile_inner 2025-12-04T13:44:25.4793251Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return _compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.4793524Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1459, in _compile_inner 2025-12-04T13:44:25.4793677Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] dynamo_output = compile_frame( 2025-12-04T13:44:25.4793945Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1341, in compile_frame 2025-12-04T13:44:25.4794141Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] bytecode, tracer_output = transform_code_object(code, transform) 2025-12-04T13:44:25.4794443Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1600, in transform_code_object 2025-12-04T13:44:25.4794638Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] tracer_output = transformations(instructions, code_options) 2025-12-04T13:44:25.4794895Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1313, in transform 2025-12-04T13:44:25.4795047Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] tracer_output = trace_frame( 2025-12-04T13:44:25.4795292Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 328, in _fn 2025-12-04T13:44:25.4795442Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.4795699Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 837, in trace_frame 2025-12-04T13:44:25.4795826Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] run_tracer() 2025-12-04T13:44:25.4796082Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 818, in run_tracer 2025-12-04T13:44:25.4796210Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] tracer.run() 2025-12-04T13:44:25.4796465Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1639, in run 2025-12-04T13:44:25.4796624Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] while self.step(): 2025-12-04T13:44:25.4796880Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1319, in step 2025-12-04T13:44:25.4797048Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] self.dispatch_table[inst.opcode](self, inst) 2025-12-04T13:44:25.4797304Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 856, in wrapper 2025-12-04T13:44:25.4797552Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return handle_graph_break(self, inst, speculation.reason) 2025-12-04T13:44:25.4797829Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 918, in handle_graph_break 2025-12-04T13:44:25.4798018Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] all_stack_locals_metadata = self.output.compile_subgraph( 2025-12-04T13:44:25.4798284Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1712, in compile_subgraph 2025-12-04T13:44:25.4798438Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] self.run_compiler_collective() 2025-12-04T13:44:25.4798716Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 2069, in run_compiler_collective 2025-12-04T13:44:25.4798917Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] dist.all_gather_object(all_states, ds.local_state, group=compile_pg) 2025-12-04T13:44:25.4799175Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4799325Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4799610Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3248, in all_gather_object 2025-12-04T13:44:25.4799790Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] all_gather(object_size_list, local_size, group=group) 2025-12-04T13:44:25.4800048Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4800193Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.4800470Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4015, in all_gather 2025-12-04T13:44:25.4800646Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] work = group.allgather([tensor_list], [tensor], opts) 2025-12-04T13:44:25.4801118Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] torch._dynamo.exc.InternalTorchDynamoError: DistBackendError: NCCL error in: /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, remote process exited or there was a network error, NCCL version 2.27.7 2025-12-04T13:44:25.4801365Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. 2025-12-04T13:44:25.4801486Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] Last error: 2025-12-04T13:44:25.4801754Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] socketPollConnect: connect to 10.42.137.215<41547> returned Connection refused, exceeded error retry count after 35 attempts 2025-12-04T13:44:25.4801895Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4802022Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] from user code: 2025-12-04T13:44:25.4802276Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 69, in inner 2025-12-04T13:44:25.4802424Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.4802533Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4802863Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" 2025-12-04T13:44:25.4802974Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4803078Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4803268Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4803522Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4803629Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.4803832Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4803993Z E1204 13:19:26.873000 47549 site-packages/torch/testing/_internal/common_distributed.py:935] exiting process 0 with exit code: 10 2025-12-04T13:44:25.4804034Z Process process 0: 2025-12-04T13:44:25.4804085Z Traceback (most recent call last): 2025-12-04T13:44:25.4804250Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.4804301Z getattr(self, test_name)() 2025-12-04T13:44:25.4804464Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.4804499Z fn() 2025-12-04T13:44:25.4804654Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.4804696Z method(*args, **kwargs) 2025-12-04T13:44:25.4804789Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.4804836Z return func(*args, **kwds) 2025-12-04T13:44:25.4804997Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.4805069Z return func(*args, **kwargs) 2025-12-04T13:44:25.4805219Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 686, in test_fsdp_aot_eager 2025-12-04T13:44:25.4805264Z outputs = fsdp_m(inputs) 2025-12-04T13:44:25.4805400Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441, in __call__ 2025-12-04T13:44:25.4805450Z return super().__call__(*args, **kwargs) 2025-12-04T13:44:25.4805604Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl 2025-12-04T13:44:25.4805655Z return self._call_impl(*args, **kwargs) 2025-12-04T13:44:25.4805797Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl 2025-12-04T13:44:25.4805867Z return forward_call(*args, **kwargs) 2025-12-04T13:44:25.4806012Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926, in compile_wrapper 2025-12-04T13:44:25.4806057Z return fn(*args, **kwargs) 2025-12-04T13:44:25.4806197Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 2194, in __call__ 2025-12-04T13:44:25.4806248Z result = self._torchdynamo_orig_backend( 2025-12-04T13:44:25.4806386Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1937, in __call__ 2025-12-04T13:44:25.4806431Z result = self._inner_convert( 2025-12-04T13:44:25.4806569Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 706, in __call__ 2025-12-04T13:44:25.4806610Z result = _compile( 2025-12-04T13:44:25.4806750Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1807, in _compile 2025-12-04T13:44:25.4806804Z raise InternalTorchDynamoError( 2025-12-04T13:44:25.4806945Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1744, in _compile 2025-12-04T13:44:25.4807032Z guarded_code, tracer_output = compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.4807169Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function 2025-12-04T13:44:25.4807221Z return function(*args, **kwargs) 2025-12-04T13:44:25.4807367Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1425, in compile_inner 2025-12-04T13:44:25.4807427Z return _compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.4807617Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1459, in _compile_inner 2025-12-04T13:44:25.4807667Z dynamo_output = compile_frame( 2025-12-04T13:44:25.4807812Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1341, in compile_frame 2025-12-04T13:44:25.4807896Z bytecode, tracer_output = transform_code_object(code, transform) 2025-12-04T13:44:25.4808070Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1600, in transform_code_object 2025-12-04T13:44:25.4808148Z tracer_output = transformations(instructions, code_options) 2025-12-04T13:44:25.4808287Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1313, in transform 2025-12-04T13:44:25.4808334Z tracer_output = trace_frame( 2025-12-04T13:44:25.4808462Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 328, in _fn 2025-12-04T13:44:25.4808508Z return fn(*args, **kwargs) 2025-12-04T13:44:25.4808646Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 837, in trace_frame 2025-12-04T13:44:25.4808689Z run_tracer() 2025-12-04T13:44:25.4808827Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 818, in run_tracer 2025-12-04T13:44:25.4808894Z tracer.run() 2025-12-04T13:44:25.4809030Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1639, in run 2025-12-04T13:44:25.4809075Z while self.step(): 2025-12-04T13:44:25.4809211Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1319, in step 2025-12-04T13:44:25.4809272Z self.dispatch_table[inst.opcode](self, inst) 2025-12-04T13:44:25.4809415Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 856, in wrapper 2025-12-04T13:44:25.4809487Z return handle_graph_break(self, inst, speculation.reason) 2025-12-04T13:44:25.4809646Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 918, in handle_graph_break 2025-12-04T13:44:25.4809747Z all_stack_locals_metadata = self.output.compile_subgraph( 2025-12-04T13:44:25.4809903Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1712, in compile_subgraph 2025-12-04T13:44:25.4809949Z self.run_compiler_collective() 2025-12-04T13:44:25.4810108Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 2069, in run_compiler_collective 2025-12-04T13:44:25.4810192Z dist.all_gather_object(all_states, ds.local_state, group=compile_pg) 2025-12-04T13:44:25.4810337Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4810379Z return func(*args, **kwargs) 2025-12-04T13:44:25.4810548Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3248, in all_gather_object 2025-12-04T13:44:25.4810615Z all_gather(object_size_list, local_size, group=group) 2025-12-04T13:44:25.4810759Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.4810802Z return func(*args, **kwargs) 2025-12-04T13:44:25.4810960Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4015, in all_gather 2025-12-04T13:44:25.4811025Z work = group.allgather([tensor_list], [tensor], opts) 2025-12-04T13:44:25.4811347Z torch._dynamo.exc.InternalTorchDynamoError: DistBackendError: NCCL error in: /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, remote process exited or there was a network error, NCCL version 2.27.7 2025-12-04T13:44:25.4811475Z ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. 2025-12-04T13:44:25.4811512Z Last error: 2025-12-04T13:44:25.4811665Z socketPollConnect: connect to 10.42.137.215<41547> returned Connection refused, exceeded error retry count after 35 attempts 2025-12-04T13:44:25.4811669Z 2025-12-04T13:44:25.4811708Z from user code: 2025-12-04T13:44:25.4811852Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 69, in inner 2025-12-04T13:44:25.4811898Z return fn(*args, **kwargs) 2025-12-04T13:44:25.4811900Z 2025-12-04T13:44:25.4812119Z Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" 2025-12-04T13:44:25.4812121Z 2025-12-04T13:44:25.4812123Z 2025-12-04T13:44:25.4812200Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.4812347Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_aot_eager 2025-12-04T13:44:25.4812349Z 2025-12-04T13:44:25.4812442Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.4812445Z 2025-12-04T13:44:25.4812536Z During handling of the above exception, another exception occurred: 2025-12-04T13:44:25.4812538Z 2025-12-04T13:44:25.4812595Z Traceback (most recent call last): 2025-12-04T13:44:25.4812739Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap 2025-12-04T13:44:25.4812784Z self.run() 2025-12-04T13:44:25.4812893Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108, in run 2025-12-04T13:44:25.4812952Z self._target(*self._args, **self._kwargs) 2025-12-04T13:44:25.4813112Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1669, in _run 2025-12-04T13:44:25.4813171Z self.run_test(test_name, parent_pipe) 2025-12-04T13:44:25.4813331Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 942, in run_test 2025-12-04T13:44:25.4813411Z parent_pipe.send(traceback.format_exc()) 2025-12-04T13:44:25.4813525Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 206, in send 2025-12-04T13:44:25.4813588Z self._send_bytes(_ForkingPickler.dumps(obj)) 2025-12-04T13:44:25.4813713Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 411, in _send_bytes 2025-12-04T13:44:25.4813761Z self._send(header + buf) 2025-12-04T13:44:25.4813874Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 368, in _send 2025-12-04T13:44:25.4813926Z n = write(self._handle, buf) 2025-12-04T13:44:25.4813980Z BrokenPipeError: [Errno 32] Broken pipe 2025-12-04T13:44:25.4814158Z [rank3]:[W1204 13:19:27.459027174 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4814160Z 2025-12-04T13:44:25.4814447Z [rank3]:[W1204 13:19:28.949073114 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4814635Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4814688Z C++ CapturedTraceback: 2025-12-04T13:44:25.4815068Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4815224Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4815378Z #6 void c10d::tcputil::recvBytes(int, c10d::detail::CheckResponseType*, unsigned long) from :0 2025-12-04T13:44:25.4815643Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4815726Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4815803Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4815862Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4815948Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4815950Z 2025-12-04T13:44:25.4816300Z [rank3]:[W1204 13:19:28.949291479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4816602Z [rank1]:[W1204 13:19:28.961168233 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4816780Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4817047Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4817220Z frame #1: + 0x6eb755f (0x7e108835655f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4847667Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1df (0x7e108835210f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4847923Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4848043Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4848146Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4848255Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4848257Z 2025-12-04T13:44:25.4848606Z [rank1]:[W1204 13:19:28.963480741 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4848783Z [rank1]:[W1204 13:19:28.259436478 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4848786Z 2025-12-04T13:44:25.4849066Z [rank2]:[W1204 13:19:28.309735479 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4849248Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4849514Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4849685Z frame #1: + 0x6eb755f (0x78d47cd5655f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4850061Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1df (0x78d47cd5210f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4850268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4850386Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4850488Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4850596Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4850600Z 2025-12-04T13:44:25.4850978Z [rank2]:[W1204 13:19:28.310956631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4851141Z [rank2]:[W1204 13:19:28.311439600 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4851143Z 2025-12-04T13:44:25.4851423Z [rank3]:[W1204 13:19:28.309781358 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4851596Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4851879Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4852045Z frame #1: + 0x6eb755f (0x719d5615655f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4852412Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1df (0x719d5615210f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4852619Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4852727Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4852827Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4852927Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4852928Z 2025-12-04T13:44:25.4853275Z [rank3]:[W1204 13:19:28.311587257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4853561Z [rank1]:[W1204 13:19:28.586638143 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4853743Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4853794Z C++ CapturedTraceback: 2025-12-04T13:44:25.4854170Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4854322Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4854479Z #6 void c10d::tcputil::recvBytes(int, c10d::detail::CheckResponseType*, unsigned long) from :0 2025-12-04T13:44:25.4854737Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4854823Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4854963Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4855022Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4855100Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4855102Z 2025-12-04T13:44:25.4855445Z [rank1]:[W1204 13:19:28.586789270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4855609Z [rank3]:[W1204 13:19:29.949456649 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4855636Z 2025-12-04T13:44:25.4855810Z [rank3]:[W1204 13:19:29.955230240 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4855986Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4856035Z C++ CapturedTraceback: 2025-12-04T13:44:25.4856409Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4856554Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4856676Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4856930Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4857011Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4857077Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4857136Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4857210Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4857212Z 2025-12-04T13:44:25.4857451Z [rank3]:[W1204 13:19:29.955376486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4857660Z [rank1]:[W1204 13:19:29.963703060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4857839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4858101Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4858268Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4858639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4858843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4858984Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4859082Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4859190Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4859194Z 2025-12-04T13:44:25.4859433Z [rank1]:[W1204 13:19:29.965663716 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4859603Z [rank2]:[W1204 13:19:29.311098632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4859811Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4860072Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4860240Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4860609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4860818Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4860931Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4861030Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4861130Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4861132Z 2025-12-04T13:44:25.4861369Z [rank2]:[W1204 13:19:29.312325235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4861545Z [rank3]:[W1204 13:19:29.311713698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4861723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4861982Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4862150Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4862516Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4862725Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4862834Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4862960Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4863060Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4863062Z 2025-12-04T13:44:25.4863300Z [rank3]:[W1204 13:19:29.313039189 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4863466Z [rank1]:[W1204 13:19:29.586937581 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4863472Z 2025-12-04T13:44:25.4863646Z [rank1]:[W1204 13:19:29.591426280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4863857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4863901Z C++ CapturedTraceback: 2025-12-04T13:44:25.4864281Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4864428Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4864555Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4864809Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4864894Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4864970Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4865029Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4865111Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4865113Z 2025-12-04T13:44:25.4865349Z [rank1]:[W1204 13:19:29.591950688 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4865635Z [rank2]:[W1204 13:19:29.621875477 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4865815Z Exception raised from recvBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first): 2025-12-04T13:44:25.4865864Z C++ CapturedTraceback: 2025-12-04T13:44:25.4866234Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4866385Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4866544Z #6 void c10d::tcputil::recvBytes(int, c10d::detail::CheckResponseType*, unsigned long) from :0 2025-12-04T13:44:25.4866817Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4866902Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4866971Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4867036Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4867113Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4867115Z 2025-12-04T13:44:25.4867460Z [rank2]:[W1204 13:19:29.622109112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash? 2025-12-04T13:44:25.4867674Z [rank3]:[W1204 13:19:30.955523978 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4867676Z 2025-12-04T13:44:25.4867852Z [rank3]:[W1204 13:19:30.955616406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4868030Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4868072Z C++ CapturedTraceback: 2025-12-04T13:44:25.4868443Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4868585Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4868716Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4868968Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4869051Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4869121Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4869185Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4869263Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4869265Z 2025-12-04T13:44:25.4869507Z [rank3]:[W1204 13:19:30.955723004 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4869686Z [rank1]:[W1204 13:19:30.965845396 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4869864Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4870128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4870292Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4870666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4870908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4871016Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4871119Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4871218Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4871220Z 2025-12-04T13:44:25.4871458Z [rank1]:[W1204 13:19:30.968216543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4871656Z [rank2]:[W1204 13:19:30.312476907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4871841Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4872100Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4872272Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4872648Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4872853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4872965Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4873063Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4873167Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4873169Z 2025-12-04T13:44:25.4873403Z [rank2]:[W1204 13:19:30.314425113 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4873580Z [rank3]:[W1204 13:19:30.313166011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4873758Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4874016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4874182Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4874549Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4874778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4874887Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4874991Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4875091Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4875098Z 2025-12-04T13:44:25.4875334Z [rank3]:[W1204 13:19:30.315572907 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4875509Z [rank1]:[W1204 13:19:30.592062572 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4875534Z 2025-12-04T13:44:25.4875709Z [rank1]:[W1204 13:19:30.592138300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4875891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4875938Z C++ CapturedTraceback: 2025-12-04T13:44:25.4876311Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4876456Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4876582Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4876834Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4876912Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4876981Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4877037Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4877113Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4877115Z 2025-12-04T13:44:25.4877349Z [rank1]:[W1204 13:19:30.592216258 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4877556Z [rank2]:[W1204 13:19:30.622274144 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4877558Z 2025-12-04T13:44:25.4877744Z [rank2]:[W1204 13:19:30.628165271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4877920Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4877964Z C++ CapturedTraceback: 2025-12-04T13:44:25.4878352Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4878506Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4878658Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4878919Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4878994Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4879063Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4879119Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4879199Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4879202Z 2025-12-04T13:44:25.4879435Z [rank2]:[W1204 13:19:30.628353397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4879640Z [rank3]:[W1204 13:19:31.955870466 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4879642Z 2025-12-04T13:44:25.4879816Z [rank3]:[W1204 13:19:31.955976684 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4879988Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4880031Z C++ CapturedTraceback: 2025-12-04T13:44:25.4880397Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4880549Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4880669Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4880924Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4881006Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4881075Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4881139Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4881372Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4881376Z 2025-12-04T13:44:25.4881621Z [rank3]:[W1204 13:19:31.956072442 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4881792Z [rank1]:[W1204 13:19:31.968369006 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4881976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4882236Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4882406Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4882801Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4883004Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4883116Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4883217Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4883319Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4883321Z 2025-12-04T13:44:25.4883573Z [rank1]:[W1204 13:19:31.970837410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4883749Z [rank2]:[W1204 13:19:31.314509607 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4883926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4884188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4884360Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4884735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4884945Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4885051Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4885152Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4885251Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4885253Z 2025-12-04T13:44:25.4885493Z [rank2]:[W1204 13:19:31.316526192 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4885670Z [rank3]:[W1204 13:19:31.315691531 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4885846Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4886106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4886270Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4886639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4886863Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4886971Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4887069Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4887167Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4887169Z 2025-12-04T13:44:25.4887409Z [rank3]:[W1204 13:19:31.317742735 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4887661Z [rank1]:[W1204 13:19:31.592344262 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4887664Z 2025-12-04T13:44:25.4887845Z [rank1]:[W1204 13:19:31.592428440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4888021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4888070Z C++ CapturedTraceback: 2025-12-04T13:44:25.4888439Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4888589Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4888712Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4888962Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4889045Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4889115Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4889180Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4889258Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4889260Z 2025-12-04T13:44:25.4889502Z [rank1]:[W1204 13:19:31.592838061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4889670Z [rank2]:[W1204 13:19:31.628489761 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4889679Z 2025-12-04T13:44:25.4889851Z [rank2]:[W1204 13:19:31.628594388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4890033Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4890074Z C++ CapturedTraceback: 2025-12-04T13:44:25.4890449Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4890623Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4890747Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4890993Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4891072Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4891141Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4891200Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4891273Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4891307Z 2025-12-04T13:44:25.4891541Z [rank2]:[W1204 13:19:31.628707886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4891706Z [rank3]:[W1204 13:19:32.956532879 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4891708Z 2025-12-04T13:44:25.4891876Z [rank3]:[W1204 13:19:32.956641676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4892050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4892091Z C++ CapturedTraceback: 2025-12-04T13:44:25.4892460Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4892605Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4892726Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4892978Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4893056Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4893130Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4893190Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4893272Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4893274Z 2025-12-04T13:44:25.4893511Z [rank3]:[W1204 13:19:32.957234613 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4893687Z [rank1]:[W1204 13:19:32.970934435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4893863Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4894128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4894302Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4894696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4894907Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4895017Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4895124Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4895246Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4895250Z 2025-12-04T13:44:25.4895492Z [rank1]:[W1204 13:19:32.973312112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4895668Z [rank2]:[W1204 13:19:32.316677696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4895845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4896111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4896279Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4896654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4896858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4896970Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4897070Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4897176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4897180Z 2025-12-04T13:44:25.4897425Z [rank2]:[W1204 13:19:32.318118184 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4897643Z [rank3]:[W1204 13:19:32.317872300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4897826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4898083Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4898253Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4898650Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4898860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4898969Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4899068Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4899174Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4899202Z 2025-12-04T13:44:25.4899441Z [rank3]:[W1204 13:19:32.320048741 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4899615Z [rank1]:[W1204 13:19:32.592890208 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4899617Z 2025-12-04T13:44:25.4899788Z [rank1]:[W1204 13:19:32.592958736 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4899970Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4900020Z C++ CapturedTraceback: 2025-12-04T13:44:25.4900389Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4900546Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4900667Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4900921Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4901001Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4901071Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4901129Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4901208Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4901210Z 2025-12-04T13:44:25.4901443Z [rank1]:[W1204 13:19:32.593024295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4901608Z [rank2]:[W1204 13:19:32.628799242 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4901611Z 2025-12-04T13:44:25.4901783Z [rank2]:[W1204 13:19:32.628854571 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4901956Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4902000Z C++ CapturedTraceback: 2025-12-04T13:44:25.4902385Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4902533Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4902651Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4902903Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4902981Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4903047Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4903125Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4903200Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4903202Z 2025-12-04T13:44:25.4903437Z [rank2]:[W1204 13:19:32.628974448 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4903601Z [rank3]:[W1204 13:19:33.957374968 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4903603Z 2025-12-04T13:44:25.4903777Z [rank3]:[W1204 13:19:33.957492275 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4903953Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4904003Z C++ CapturedTraceback: 2025-12-04T13:44:25.4904377Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4904525Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4904651Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4904907Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4904994Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4905064Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4905128Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4905207Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4905209Z 2025-12-04T13:44:25.4905448Z [rank3]:[W1204 13:19:33.957590283 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4905616Z [rank1]:[W1204 13:19:33.973378879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4905792Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4906054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4906242Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4906613Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4906815Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4906926Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4907045Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4907149Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4907151Z 2025-12-04T13:44:25.4907386Z [rank1]:[W1204 13:19:33.975855303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4907619Z [rank2]:[W1204 13:19:33.318273499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4907798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4908054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4908223Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4908597Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4908806Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4908918Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4909017Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4909121Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4909123Z 2025-12-04T13:44:25.4909357Z [rank2]:[W1204 13:19:33.320558308 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4909530Z [rank3]:[W1204 13:19:33.320185586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4909703Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4909964Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4910131Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4910530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4910735Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4910841Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4910939Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4911064Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4911066Z 2025-12-04T13:44:25.4911302Z [rank3]:[W1204 13:19:33.322218541 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4911468Z [rank1]:[W1204 13:19:33.593070423 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4911473Z 2025-12-04T13:44:25.4911643Z [rank1]:[W1204 13:19:33.593130731 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4911819Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4911861Z C++ CapturedTraceback: 2025-12-04T13:44:25.4912235Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4912379Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4912501Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4912748Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4912829Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4912898Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4912957Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4913034Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4913038Z 2025-12-04T13:44:25.4913271Z [rank1]:[W1204 13:19:33.593525852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4913439Z [rank2]:[W1204 13:19:33.629063495 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4913441Z 2025-12-04T13:44:25.4913608Z [rank2]:[W1204 13:19:33.629124283 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4913785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4913828Z C++ CapturedTraceback: 2025-12-04T13:44:25.4914220Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4914366Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4914484Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4914739Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4914834Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4914903Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4914960Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4915041Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4915043Z 2025-12-04T13:44:25.4915277Z [rank2]:[W1204 13:19:33.629184132 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4915448Z [rank3]:[W1204 13:19:34.957728509 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4915450Z 2025-12-04T13:44:25.4915620Z [rank3]:[W1204 13:19:34.957827287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4915804Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4915979Z C++ CapturedTraceback: 2025-12-04T13:44:25.4916387Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4916541Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4916684Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4916946Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4917055Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4917140Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4917227Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4917314Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4917330Z 2025-12-04T13:44:25.4917622Z [rank3]:[W1204 13:19:34.958175249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4917825Z [rank1]:[W1204 13:19:34.976010379 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4918023Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4918344Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4918522Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4918912Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4919152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4919307Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4919432Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4919541Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4919543Z 2025-12-04T13:44:25.4919807Z [rank1]:[W1204 13:19:34.978418755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4919987Z [rank2]:[W1204 13:19:34.320718964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4920203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4920485Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4920659Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4921056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4921262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4921418Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4921526Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4921650Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4921653Z 2025-12-04T13:44:25.4921923Z [rank2]:[W1204 13:19:34.322594572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4922101Z [rank3]:[W1204 13:19:34.322341068 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4922317Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4922584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4922797Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4923174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4923394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4923541Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4923676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4923799Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4923801Z 2025-12-04T13:44:25.4924048Z [rank3]:[W1204 13:19:34.324920110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4924231Z [rank1]:[W1204 13:19:34.593628660 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4924233Z 2025-12-04T13:44:25.4924426Z [rank1]:[W1204 13:19:34.593690529 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4924635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4924703Z C++ CapturedTraceback: 2025-12-04T13:44:25.4925087Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4925249Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4925391Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4925677Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4925765Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4925856Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4925925Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4926029Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4926031Z 2025-12-04T13:44:25.4926297Z [rank1]:[W1204 13:19:34.593734288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4926472Z [rank2]:[W1204 13:19:34.629337439 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4926474Z 2025-12-04T13:44:25.4926667Z [rank2]:[W1204 13:19:34.629465856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4926853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4926950Z C++ CapturedTraceback: 2025-12-04T13:44:25.4927335Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4927521Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4927664Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4927923Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4928067Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4928151Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4928230Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4928315Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4928317Z 2025-12-04T13:44:25.4928576Z [rank2]:[W1204 13:19:34.629727960 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4928744Z [rank3]:[W1204 13:19:35.958309047 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4928748Z 2025-12-04T13:44:25.4928956Z [rank3]:[W1204 13:19:35.958408115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4929155Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4929207Z C++ CapturedTraceback: 2025-12-04T13:44:25.4929605Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4929755Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4929914Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4930178Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4930281Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4930361Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4930435Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4930552Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4930553Z 2025-12-04T13:44:25.4930797Z [rank3]:[W1204 13:19:35.958491973 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4931005Z [rank1]:[W1204 13:19:35.978594172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4931218Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4931498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4931685Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4932089Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4932335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4932455Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4932572Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4932694Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4932696Z 2025-12-04T13:44:25.4932966Z [rank1]:[W1204 13:19:35.981080116 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4933147Z [rank2]:[W1204 13:19:35.322737660 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4933350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4933630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4933817Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4934219Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4934435Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4934566Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4934677Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4934809Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4934811Z 2025-12-04T13:44:25.4935078Z [rank2]:[W1204 13:19:35.324623947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4935258Z [rank3]:[W1204 13:19:35.325054188 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4935462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4935753Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4935950Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4936332Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4936563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4936718Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4936826Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4936956Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4936958Z 2025-12-04T13:44:25.4937210Z [rank3]:[W1204 13:19:35.327290908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4937402Z [rank1]:[W1204 13:19:35.593826967 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4937404Z 2025-12-04T13:44:25.4937632Z [rank1]:[W1204 13:19:35.593895485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4937838Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4937917Z C++ CapturedTraceback: 2025-12-04T13:44:25.4938312Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4938481Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4938615Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4938889Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4938972Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4939082Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4939303Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4939410Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4939412Z 2025-12-04T13:44:25.4939669Z [rank1]:[W1204 13:19:35.594527571 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4939838Z [rank2]:[W1204 13:19:35.629848299 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4939842Z 2025-12-04T13:44:25.4940048Z [rank2]:[W1204 13:19:35.629939457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4940267Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4940332Z C++ CapturedTraceback: 2025-12-04T13:44:25.4940713Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4940876Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4941068Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4941329Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4941429Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4941511Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4941589Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4941689Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4941691Z 2025-12-04T13:44:25.4941952Z [rank2]:[W1204 13:19:35.630048644 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4942132Z [rank3]:[W1204 13:19:36.958642621 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4942149Z 2025-12-04T13:44:25.4942329Z [rank3]:[W1204 13:19:36.958744969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4942518Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4942870Z C++ CapturedTraceback: 2025-12-04T13:44:25.4943272Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4943428Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4943576Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4943837Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4943945Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4944042Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4944109Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4944205Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4944207Z 2025-12-04T13:44:25.4944452Z [rank3]:[W1204 13:19:36.958844947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4944673Z [rank1]:[W1204 13:19:36.981194755 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4944863Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4945142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4945316Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4945716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4945969Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4946092Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4946218Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4946326Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4946329Z 2025-12-04T13:44:25.4946589Z [rank1]:[W1204 13:19:36.983617441 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4946768Z [rank2]:[W1204 13:19:36.324735517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4946992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4947280Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4947454Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4947904Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4948122Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4948267Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4948380Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4948503Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4948505Z 2025-12-04T13:44:25.4948760Z [rank2]:[W1204 13:19:36.326260723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4948939Z [rank3]:[W1204 13:19:36.327376658 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4949187Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4949454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.4949642Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4950035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.4950269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.4950417Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.4950527Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4951317Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.4951319Z 2025-12-04T13:44:25.4951561Z [rank3]:[W1204 13:19:36.329514380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4951750Z [rank1]:[W1204 13:19:36.594626201 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4951756Z 2025-12-04T13:44:25.4951967Z [rank1]:[W1204 13:19:36.594698430 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4952153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4952218Z C++ CapturedTraceback: 2025-12-04T13:44:25.4952600Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4952767Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4952909Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4953189Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4953276Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4953370Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4953450Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4953545Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4953547Z 2025-12-04T13:44:25.4953815Z [rank1]:[W1204 13:19:36.594772328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4953994Z [rank2]:[W1204 13:19:36.630195273 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4953996Z 2025-12-04T13:44:25.4954222Z [rank2]:[W1204 13:19:36.630307881 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4954407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4954481Z C++ CapturedTraceback: 2025-12-04T13:44:25.4954868Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4955068Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4955211Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4955472Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4955575Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4955661Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4955746Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4955836Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4955837Z 2025-12-04T13:44:25.4956093Z [rank2]:[W1204 13:19:36.630686892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4956293Z [rank3]:[W1204 13:19:37.959345048 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.4956296Z 2025-12-04T13:44:25.4956481Z [rank3]:[W1204 13:19:37.959449866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4956678Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.4956730Z C++ CapturedTraceback: 2025-12-04T13:44:25.4957119Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.4957269Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.4957435Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.4957829Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.4957926Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.4958014Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.4958077Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.4958197Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.4958200Z 2025-12-04T13:44:25.4958445Z [rank3]:[W1204 13:19:37.959549234 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.4958666Z [rank1]:[W1204 13:19:37.983755470 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.4958850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5026109Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5026313Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5026771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5026977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5027092Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5027192Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5027292Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5027296Z 2025-12-04T13:44:25.5027652Z [rank1]:[W1204 13:19:37.986078668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5027836Z [rank2]:[W1204 13:19:37.326417302 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5028019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5028279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5028447Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5028823Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5029030Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5029141Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5029238Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5029338Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5029340Z 2025-12-04T13:44:25.5029578Z [rank2]:[W1204 13:19:37.328189443 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5029797Z [rank3]:[W1204 13:19:37.329642080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5029974Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5030236Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5030399Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5030768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5031001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5031107Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5031204Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5031299Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5031301Z 2025-12-04T13:44:25.5031537Z [rank3]:[W1204 13:19:37.331662855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5031704Z [rank1]:[W1204 13:19:37.594882119 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5031708Z 2025-12-04T13:44:25.5031878Z [rank1]:[W1204 13:19:37.594969507 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5032054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5032097Z C++ CapturedTraceback: 2025-12-04T13:44:25.5032475Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5032622Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5032745Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5032994Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5033073Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5033142Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5033201Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5033279Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5033282Z 2025-12-04T13:44:25.5033515Z [rank1]:[W1204 13:19:37.595434417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5033702Z [rank2]:[W1204 13:19:37.630791353 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5033704Z 2025-12-04T13:44:25.5033873Z [rank2]:[W1204 13:19:37.630849362 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5034047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5034089Z C++ CapturedTraceback: 2025-12-04T13:44:25.5034457Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5034623Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5034744Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5034997Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5035072Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5035140Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5035199Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5035274Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5035278Z 2025-12-04T13:44:25.5035511Z [rank2]:[W1204 13:19:37.630909551 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5035676Z [rank3]:[W1204 13:19:38.959693144 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5035678Z 2025-12-04T13:44:25.5035845Z [rank3]:[W1204 13:19:38.959798592 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5036023Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5036065Z C++ CapturedTraceback: 2025-12-04T13:44:25.5036429Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5036576Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5036693Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5036941Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5037015Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5037084Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5037141Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5037218Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5037220Z 2025-12-04T13:44:25.5037469Z [rank3]:[W1204 13:19:38.959893400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5037683Z [rank1]:[W1204 13:19:38.986258088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5037857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5038116Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5038307Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5038673Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5038875Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5038984Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5039080Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5039181Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5039183Z 2025-12-04T13:44:25.5039415Z [rank1]:[W1204 13:19:38.987771554 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5039584Z [rank2]:[W1204 13:19:38.328306914 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5039760Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5040019Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5040181Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5040551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5040753Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5040859Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5040957Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5041054Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5041058Z 2025-12-04T13:44:25.5041314Z [rank2]:[W1204 13:19:38.329845250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5041484Z [rank3]:[W1204 13:19:38.331765037 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5041659Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5041916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5042080Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5042473Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5042673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5042779Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5042875Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5042975Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5042977Z 2025-12-04T13:44:25.5043211Z [rank3]:[W1204 13:19:38.334357448 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5043378Z [rank1]:[W1204 13:19:38.595557228 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5043380Z 2025-12-04T13:44:25.5043550Z [rank1]:[W1204 13:19:38.595641016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5043723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5043767Z C++ CapturedTraceback: 2025-12-04T13:44:25.5044137Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5044286Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5044403Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5044651Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5044728Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5044794Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5044850Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5044925Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5044927Z 2025-12-04T13:44:25.5045178Z [rank1]:[W1204 13:19:38.595704375 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5045343Z [rank2]:[W1204 13:19:38.631045352 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5045345Z 2025-12-04T13:44:25.5045513Z [rank2]:[W1204 13:19:38.631135670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5045685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5045727Z C++ CapturedTraceback: 2025-12-04T13:44:25.5046099Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5046263Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5046382Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5046629Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5046705Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5046771Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5046827Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5046902Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5046908Z 2025-12-04T13:44:25.5047141Z [rank2]:[W1204 13:19:38.631772936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5047303Z [rank3]:[W1204 13:19:39.960319655 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5047306Z 2025-12-04T13:44:25.5047533Z [rank3]:[W1204 13:19:39.960435782 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5047706Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5047748Z C++ CapturedTraceback: 2025-12-04T13:44:25.5048112Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5048257Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5048374Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5048626Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5048702Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5048767Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5048822Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5048923Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5048925Z 2025-12-04T13:44:25.5049160Z [rank3]:[W1204 13:19:39.960535560 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5049328Z [rank1]:[W1204 13:19:39.987866297 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5049499Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5049756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5049949Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5050319Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5050521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5050628Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5050728Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5050824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5050828Z 2025-12-04T13:44:25.5051061Z [rank1]:[W1204 13:19:39.990275973 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5051230Z [rank2]:[W1204 13:19:39.330014941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5051405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5051659Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5051828Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5052196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5052398Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5052505Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5052601Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5052700Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5052702Z 2025-12-04T13:44:25.5052956Z [rank2]:[W1204 13:19:39.331722773 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5053128Z [rank3]:[W1204 13:19:39.334485051 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5053303Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5053558Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5053742Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5054108Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5054311Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5054414Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5054510Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5054610Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5054613Z 2025-12-04T13:44:25.5054845Z [rank3]:[W1204 13:19:39.336590063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5055010Z [rank1]:[W1204 13:19:39.595799688 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5055013Z 2025-12-04T13:44:25.5055181Z [rank1]:[W1204 13:19:39.595859007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5055356Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5055396Z C++ CapturedTraceback: 2025-12-04T13:44:25.5055768Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5055911Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5056028Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5056276Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5056351Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5056419Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5056474Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5056548Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5056570Z 2025-12-04T13:44:25.5056803Z [rank1]:[W1204 13:19:39.596319497 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5056965Z [rank2]:[W1204 13:19:39.631855639 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5056967Z 2025-12-04T13:44:25.5057135Z [rank2]:[W1204 13:19:39.631909128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5057308Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5057374Z C++ CapturedTraceback: 2025-12-04T13:44:25.5057788Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5057932Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5058048Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5058297Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5058374Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5058442Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5058497Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5058574Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5058576Z 2025-12-04T13:44:25.5058807Z [rank2]:[W1204 13:19:39.631966057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5058969Z [rank3]:[W1204 13:19:40.960681933 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5058972Z 2025-12-04T13:44:25.5059141Z [rank3]:[W1204 13:19:40.960785221 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5059316Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5059358Z C++ CapturedTraceback: 2025-12-04T13:44:25.5059723Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5059867Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5059982Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5060230Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5060306Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5060400Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5060456Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5060529Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5060531Z 2025-12-04T13:44:25.5060763Z [rank3]:[W1204 13:19:40.960882829 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5060930Z [rank1]:[W1204 13:19:40.990384136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5061103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5061387Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5061551Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5061916Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5062115Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5062225Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5062323Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5062421Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5062423Z 2025-12-04T13:44:25.5062654Z [rank1]:[W1204 13:19:40.992876690 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5062824Z [rank2]:[W1204 13:19:40.331870886 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5062998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5063254Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5063419Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5063785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5063986Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5064096Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5064193Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5064309Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5064313Z 2025-12-04T13:44:25.5064543Z [rank2]:[W1204 13:19:40.333259545 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5064713Z [rank3]:[W1204 13:19:40.336677089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5064885Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5065140Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5065323Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5065691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5065891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5065995Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5066093Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5066190Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5066192Z 2025-12-04T13:44:25.5066424Z [rank3]:[W1204 13:19:40.339211302 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5066587Z [rank1]:[W1204 13:19:40.596431932 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5066589Z 2025-12-04T13:44:25.5066758Z [rank1]:[W1204 13:19:40.596512710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5066930Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5066973Z C++ CapturedTraceback: 2025-12-04T13:44:25.5067343Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5067534Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5067653Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5067903Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5067983Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5068049Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5068134Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5068208Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5068210Z 2025-12-04T13:44:25.5068442Z [rank1]:[W1204 13:19:40.596591658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5068606Z [rank2]:[W1204 13:19:40.632045323 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5068610Z 2025-12-04T13:44:25.5068776Z [rank2]:[W1204 13:19:40.632096072 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5068976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5069016Z C++ CapturedTraceback: 2025-12-04T13:44:25.5069388Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5069531Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5069651Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5069896Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5069973Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5070041Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5070096Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5070170Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5070171Z 2025-12-04T13:44:25.5070401Z [rank2]:[W1204 13:19:40.632514202 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5070565Z [rank3]:[W1204 13:19:41.961035832 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5070568Z 2025-12-04T13:44:25.5070736Z [rank3]:[W1204 13:19:41.961143860 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5070911Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5070951Z C++ CapturedTraceback: 2025-12-04T13:44:25.5071321Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5071466Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5071582Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5071856Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5071930Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5071997Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5072052Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5072128Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5072130Z 2025-12-04T13:44:25.5072361Z [rank3]:[W1204 13:19:41.961233808 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5072530Z [rank1]:[W1204 13:19:41.992985316 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5072721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5072980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5073146Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5073514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5073721Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5073828Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5073926Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5074022Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5074024Z 2025-12-04T13:44:25.5074257Z [rank1]:[W1204 13:19:41.995414941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5074428Z [rank2]:[W1204 13:19:41.333431839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5074602Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5074859Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5075022Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5075388Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5075589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5075717Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5075815Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5075913Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5075915Z 2025-12-04T13:44:25.5076150Z [rank2]:[W1204 13:19:41.335549981 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5076318Z [rank3]:[W1204 13:19:41.339330367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5076513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5076770Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5076933Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5077298Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5077529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5077639Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5077736Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5077834Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5077836Z 2025-12-04T13:44:25.5078069Z [rank3]:[W1204 13:19:41.341152466 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5078236Z [rank1]:[W1204 13:19:41.596728363 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5078238Z 2025-12-04T13:44:25.5078408Z [rank1]:[W1204 13:19:41.596839920 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5078586Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5078629Z C++ CapturedTraceback: 2025-12-04T13:44:25.5078998Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5079143Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5079260Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5079507Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5079613Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5079679Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5079733Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5079807Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5079809Z 2025-12-04T13:44:25.5080042Z [rank1]:[W1204 13:19:41.597411127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5080202Z [rank2]:[W1204 13:19:41.632646437 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5080231Z 2025-12-04T13:44:25.5080400Z [rank2]:[W1204 13:19:41.632761205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5080574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5080615Z C++ CapturedTraceback: 2025-12-04T13:44:25.5080980Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5081124Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5081242Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5081495Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5081570Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5081635Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5081693Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5081767Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5081769Z 2025-12-04T13:44:25.5082003Z [rank2]:[W1204 13:19:41.632867592 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5082164Z [rank3]:[W1204 13:19:42.961382593 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5082168Z 2025-12-04T13:44:25.5082340Z [rank3]:[W1204 13:19:42.961497330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5082512Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5082556Z C++ CapturedTraceback: 2025-12-04T13:44:25.5082921Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5083063Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5083184Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5083450Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5083526Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5083592Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5083649Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5083722Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5083725Z 2025-12-04T13:44:25.5083959Z [rank3]:[W1204 13:19:42.961601548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5084146Z [rank1]:[W1204 13:19:42.995577446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5084319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5084579Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5084742Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5085109Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5085317Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5085425Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5085524Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5085622Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5085624Z 2025-12-04T13:44:25.5085859Z [rank1]:[W1204 13:19:42.998371283 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5086030Z [rank2]:[W1204 13:19:42.335698697 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5086206Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5086462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5086628Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5086999Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5087225Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5087334Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5087431Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5087550Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5087553Z 2025-12-04T13:44:25.5087787Z [rank2]:[W1204 13:19:42.337273721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5087957Z [rank3]:[W1204 13:19:42.341271802 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5088160Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5088417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5088580Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5088946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5089152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5089258Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5089356Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5089451Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5089453Z 2025-12-04T13:44:25.5089686Z [rank3]:[W1204 13:19:42.343733047 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5089851Z [rank1]:[W1204 13:19:42.597506444 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5089855Z 2025-12-04T13:44:25.5090023Z [rank1]:[W1204 13:19:42.597580233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5090198Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5090240Z C++ CapturedTraceback: 2025-12-04T13:44:25.5090607Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5090750Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5090870Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5091144Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5091222Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5091289Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5091343Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5091419Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5091421Z 2025-12-04T13:44:25.5091653Z [rank1]:[W1204 13:19:42.597639401 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5091838Z [rank2]:[W1204 13:19:42.632969059 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5091840Z 2025-12-04T13:44:25.5092009Z [rank2]:[W1204 13:19:42.633044687 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5092183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5092223Z C++ CapturedTraceback: 2025-12-04T13:44:25.5092592Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5092737Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5092855Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5093104Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5093178Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5093244Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5093299Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5093374Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5093376Z 2025-12-04T13:44:25.5093607Z [rank2]:[W1204 13:19:42.633113055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5093819Z [rank1]:W1204 13:19:42.847000 60748 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.5093982Z [rank3]:[W1204 13:19:43.961713995 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5093984Z 2025-12-04T13:44:25.5094150Z [rank3]:[W1204 13:19:43.961803783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5094324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5094363Z C++ CapturedTraceback: 2025-12-04T13:44:25.5094750Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5094894Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5095011Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5095260Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5095336Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5095404Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5095479Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5095554Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5095556Z 2025-12-04T13:44:25.5095788Z [rank3]:[W1204 13:19:43.961877611 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5095957Z [rank1]:[W1204 13:19:43.998477100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5096129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5096386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5096551Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5096924Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5097125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5097233Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5097331Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5097429Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5097431Z 2025-12-04T13:44:25.5097717Z [rank1]:[W1204 13:19:43.000909766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5097870Z [rank1]:[W1204 13:19:43.028527206 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T13:44:25.5098034Z [rank1]:[W1204 13:19:43.028558046 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5098036Z 2025-12-04T13:44:25.5098206Z [rank2]:[W1204 13:19:43.337426008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5098379Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5098660Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5098824Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5099189Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5099390Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5099530Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5099629Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5099726Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5099728Z 2025-12-04T13:44:25.5099964Z [rank2]:[W1204 13:19:43.339849323 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5100133Z [rank3]:[W1204 13:19:43.343872473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5100307Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5100563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5100728Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5101093Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5101297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5101406Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5101502Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5101601Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5101603Z 2025-12-04T13:44:25.5101834Z [rank3]:[W1204 13:19:43.345497627 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5101998Z [rank2]:[W1204 13:19:43.633244843 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5102000Z 2025-12-04T13:44:25.5102167Z [rank2]:[W1204 13:19:43.633327211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5102342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5102385Z C++ CapturedTraceback: 2025-12-04T13:44:25.5102771Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5102916Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5103033Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5103288Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5103382Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5103452Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5103507Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5103582Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5103584Z 2025-12-04T13:44:25.5103816Z [rank2]:[W1204 13:19:43.633566845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5103977Z [rank3]:[W1204 13:19:44.961984079 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5103979Z 2025-12-04T13:44:25.5104148Z [rank3]:[W1204 13:19:44.962053338 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5104321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5104363Z C++ CapturedTraceback: 2025-12-04T13:44:25.5104728Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5104872Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5104991Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5105245Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5105322Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5105390Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5105448Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5105522Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5105524Z 2025-12-04T13:44:25.5105756Z [rank3]:[W1204 13:19:44.962119816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5105925Z [rank1]:[W1204 13:19:44.001078132 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5106102Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5106380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5106546Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5106915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5107137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5107247Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5107344Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5107441Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5107443Z 2025-12-04T13:44:25.5107719Z [rank1]:[W1204 13:19:44.003513128 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5107889Z [rank2]:[W1204 13:19:44.340022260 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5108066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5108321Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5108486Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5108852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5109061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5109168Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5109268Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5109364Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5109368Z 2025-12-04T13:44:25.5109601Z [rank2]:[W1204 13:19:44.342123763 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5109774Z [rank3]:[W1204 13:19:44.345625405 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5109948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5110232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5110394Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5110759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5110961Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5111095Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5111192Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5111289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5111291Z 2025-12-04T13:44:25.5111523Z [rank3]:[W1204 13:19:44.347844655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5111685Z [rank2]:[W1204 13:19:44.633654135 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5111687Z 2025-12-04T13:44:25.5111859Z [rank2]:[W1204 13:19:44.633701624 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5112033Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5112078Z C++ CapturedTraceback: 2025-12-04T13:44:25.5112447Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5112590Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5112708Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5112957Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5113036Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5113104Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5113162Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5113236Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5113238Z 2025-12-04T13:44:25.5113471Z [rank2]:[W1204 13:19:44.633960148 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5113633Z [rank3]:[W1204 13:19:45.962264255 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5113635Z 2025-12-04T13:44:25.5113803Z [rank3]:[W1204 13:19:45.962370262 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5114012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5114054Z C++ CapturedTraceback: 2025-12-04T13:44:25.5114423Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5114567Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5114687Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5114969Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5115046Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5115113Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5115168Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5115243Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5115245Z 2025-12-04T13:44:25.5115476Z [rank3]:[W1204 13:19:45.962452450 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5115645Z [rank1]:[W1204 13:19:45.003620837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5115819Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5116078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5116241Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5116608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5116811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5116918Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5117017Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5117112Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5117114Z 2025-12-04T13:44:25.5117347Z [rank1]:[W1204 13:19:45.006063092 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5117565Z [rank2]:[W1204 13:19:45.342278102 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5117742Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5118027Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5118190Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5118560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5118761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5118891Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5118990Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5119089Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5119091Z 2025-12-04T13:44:25.5119321Z [rank2]:[W1204 13:19:45.344242168 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5119493Z [rank3]:[W1204 13:19:45.347964344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5119668Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5119924Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5120352Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5120715Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5120916Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5121023Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5121123Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5121221Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5121223Z 2025-12-04T13:44:25.5121455Z [rank3]:[W1204 13:19:45.350223764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5121619Z [rank2]:[W1204 13:19:45.634070508 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5121621Z 2025-12-04T13:44:25.5121790Z [rank2]:[W1204 13:19:45.634157836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5121969Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5122033Z C++ CapturedTraceback: 2025-12-04T13:44:25.5122403Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5122547Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5122664Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5122914Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5123012Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5123080Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5123135Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5123211Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5123212Z 2025-12-04T13:44:25.5123443Z [rank2]:[W1204 13:19:45.634307052 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5123607Z [rank3]:[W1204 13:19:46.962606869 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5123610Z 2025-12-04T13:44:25.5123781Z [rank3]:[W1204 13:19:46.962731387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5123954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5123996Z C++ CapturedTraceback: 2025-12-04T13:44:25.5124360Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5124503Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5124620Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5124873Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5124946Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5125013Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5125067Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5125142Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5125145Z 2025-12-04T13:44:25.5125377Z [rank3]:[W1204 13:19:46.962846494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5125546Z [rank1]:[W1204 13:19:46.006265850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5125742Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5126001Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5126167Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5126531Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5126754Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5126863Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5126959Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5127058Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5127060Z 2025-12-04T13:44:25.5127290Z [rank1]:[W1204 13:19:46.008744325 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5127460Z [rank2]:[W1204 13:19:46.344424547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5127674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5127932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5128095Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5128465Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5128669Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5128774Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5128872Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5128968Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5128970Z 2025-12-04T13:44:25.5129202Z [rank2]:[W1204 13:19:46.346476171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5129370Z [rank3]:[W1204 13:19:46.350370053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5129546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5129842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5130003Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5130368Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5130567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5130699Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5130797Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5130897Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5130900Z 2025-12-04T13:44:25.5131131Z [rank3]:[W1204 13:19:46.352450247 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5131296Z [rank2]:[W1204 13:19:46.634436183 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5131298Z 2025-12-04T13:44:25.5131466Z [rank2]:[W1204 13:19:46.634704047 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5131641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5131683Z C++ CapturedTraceback: 2025-12-04T13:44:25.5132050Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5132196Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5132315Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5132563Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5132641Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5132707Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5132764Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5132837Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5132841Z 2025-12-04T13:44:25.5133074Z [rank2]:[W1204 13:19:46.634793135 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5133237Z [rank3]:[W1204 13:19:47.963011314 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5133241Z 2025-12-04T13:44:25.5133410Z [rank3]:[W1204 13:19:47.963143281 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5134888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5134932Z C++ CapturedTraceback: 2025-12-04T13:44:25.5135298Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5135440Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5135580Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5135828Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5135903Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5135969Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5136024Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5136097Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5136099Z 2025-12-04T13:44:25.5136333Z [rank3]:[W1204 13:19:47.963255909 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5136506Z [rank1]:[W1204 13:19:47.008893555 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5136680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5136937Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5137099Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5137465Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5137716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5137824Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5137922Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5138019Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5138021Z 2025-12-04T13:44:25.5138252Z [rank1]:[W1204 13:19:47.011421549 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5138420Z [rank2]:[W1204 13:19:47.346612682 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5138599Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5138882Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5139049Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5139421Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5139654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5139761Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5139858Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5139956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5139957Z 2025-12-04T13:44:25.5140188Z [rank2]:[W1204 13:19:47.348570618 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5140360Z [rank3]:[W1204 13:19:47.352585868 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5140535Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5140792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5140956Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5141320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5141522Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5141631Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5141729Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5141826Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5141832Z 2025-12-04T13:44:25.5142062Z [rank3]:[W1204 13:19:47.354652431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5142226Z [rank2]:[W1204 13:19:47.634932066 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5142228Z 2025-12-04T13:44:25.5142398Z [rank2]:[W1204 13:19:47.635061123 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5142592Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5142633Z C++ CapturedTraceback: 2025-12-04T13:44:25.5142999Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5143142Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5143261Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5143536Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5143611Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5143678Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5143732Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5143807Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5143809Z 2025-12-04T13:44:25.5144040Z [rank2]:[W1204 13:19:47.635167861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5144204Z [rank3]:[W1204 13:19:48.963404300 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5144208Z 2025-12-04T13:44:25.5144376Z [rank3]:[W1204 13:19:48.963537517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5144549Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5144590Z C++ CapturedTraceback: 2025-12-04T13:44:25.5144953Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5145097Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5145215Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5145467Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5145540Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5145607Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5145661Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5145736Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5145738Z 2025-12-04T13:44:25.5145969Z [rank3]:[W1204 13:19:48.963655435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5146140Z [rank1]:[W1204 13:19:48.011529271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5146342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5146598Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5146761Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5147125Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5147348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5147454Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5147610Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5147709Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5147711Z 2025-12-04T13:44:25.5147942Z [rank1]:[W1204 13:19:48.014092663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5148113Z [rank2]:[W1204 13:19:48.348738189 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5148288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5148543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5148706Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5149072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5149276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5149382Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5149481Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5149577Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5149579Z 2025-12-04T13:44:25.5149813Z [rank2]:[W1204 13:19:48.350634397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5149981Z [rank3]:[W1204 13:19:48.354940440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5150158Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5150440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5150602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5150967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5151194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5151302Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5151397Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5151496Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5151498Z 2025-12-04T13:44:25.5151733Z [rank3]:[W1204 13:19:48.356476556 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5151896Z [rank2]:[W1204 13:19:48.635301413 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5151900Z 2025-12-04T13:44:25.5152069Z [rank2]:[W1204 13:19:48.635397391 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5152243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5152285Z C++ CapturedTraceback: 2025-12-04T13:44:25.5152651Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5152799Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5152916Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5153166Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5153244Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5153309Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5153365Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5153439Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5153441Z 2025-12-04T13:44:25.5153673Z [rank2]:[W1204 13:19:48.635498749 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5153835Z [rank3]:[W1204 13:19:49.963806427 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5153839Z 2025-12-04T13:44:25.5154028Z [rank3]:[W1204 13:19:49.963933824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5154201Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5154244Z C++ CapturedTraceback: 2025-12-04T13:44:25.5154613Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5154757Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5154901Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5155152Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5155231Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5155298Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5155359Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5155434Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5155436Z 2025-12-04T13:44:25.5155672Z [rank3]:[W1204 13:19:49.964057971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5155843Z [rank1]:[W1204 13:19:49.014268215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5156024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5156285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5156451Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5156826Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5157033Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5157146Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5157244Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5157345Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5157347Z 2025-12-04T13:44:25.5157620Z [rank1]:[W1204 13:19:49.016882547 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5157792Z [rank2]:[W1204 13:19:49.350753650 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5157998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5158256Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5158425Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5158793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5159032Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5159143Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5159242Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5159344Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5159346Z 2025-12-04T13:44:25.5159580Z [rank2]:[W1204 13:19:49.352619629 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5159754Z [rank3]:[W1204 13:19:49.356567440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5159931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5160193Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5160361Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5160726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5160933Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5161041Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5161142Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5161239Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5161241Z 2025-12-04T13:44:25.5161479Z [rank3]:[W1204 13:19:49.358834089 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5161647Z [rank2]:[W1204 13:19:49.635602093 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5161651Z 2025-12-04T13:44:25.5161822Z [rank2]:[W1204 13:19:49.635670752 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5162018Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5162061Z C++ CapturedTraceback: 2025-12-04T13:44:25.5162430Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5162574Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5162714Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5162966Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5163045Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5163111Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5163169Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5163246Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5163248Z 2025-12-04T13:44:25.5163481Z [rank2]:[W1204 13:19:49.635732760 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5163653Z [rank3]:[W1204 13:19:50.964192445 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5163655Z 2025-12-04T13:44:25.5163824Z [rank3]:[W1204 13:19:50.964281043 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5164000Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5164042Z C++ CapturedTraceback: 2025-12-04T13:44:25.5164414Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5164562Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5164681Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5164931Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5165008Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5165078Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5165134Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5165211Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5165212Z 2025-12-04T13:44:25.5165444Z [rank3]:[W1204 13:19:50.964364801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5165635Z [rank1]:[W1204 13:19:50.017059320 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5165809Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5166068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5166234Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5166603Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5166829Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5166935Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5167039Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5167136Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5167138Z 2025-12-04T13:44:25.5167374Z [rank1]:[W1204 13:19:50.019552464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5167582Z [rank2]:[W1204 13:19:50.352749403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5167761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5168017Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5168182Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5168552Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5168758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5168869Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5168970Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5169068Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5169069Z 2025-12-04T13:44:25.5169304Z [rank2]:[W1204 13:19:50.354029934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5169477Z [rank3]:[W1204 13:19:50.358917235 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5169682Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5169939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5170104Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5170476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5170703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5170814Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5170911Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5171012Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5171014Z 2025-12-04T13:44:25.5171246Z [rank3]:[W1204 13:19:50.361195674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5171420Z [rank1]:[W1204 13:19:50.367168900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5171596Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5171643Z C++ CapturedTraceback: 2025-12-04T13:44:25.5172017Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5172162Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5172286Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5172485Z #7 c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) from ??:0 2025-12-04T13:44:25.5172626Z #8 c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5172756Z #9 c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5172934Z #10 c10d::PrefixStore::get(std::__cxx11::basic_string, std::allocator > const&) [clone .localalias] from PrefixStore.cpp:0 2025-12-04T13:44:25.5173145Z #11 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) from ??:0 2025-12-04T13:44:25.5173346Z #12 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5173670Z #13 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5174088Z #14 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5175620Z #15 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5176006Z #16 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5177085Z #17 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5177374Z #18 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5181885Z #19 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5182012Z #20 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5182117Z #21 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5182221Z #22 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5182347Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5182476Z #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5182593Z #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5182712Z #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5182798Z #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5182917Z #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5183040Z #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5183157Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5183251Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5183356Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5183469Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5183595Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5183712Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5183834Z #36 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5183952Z #37 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5184072Z #38 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5184189Z #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5184309Z #40 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5184428Z #41 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5184551Z #42 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5184667Z #43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5184791Z #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5184908Z #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5185030Z #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5185146Z #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5185268Z #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5185384Z #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5185478Z #50 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5185566Z #51 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5185682Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5185803Z #53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5185919Z #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5186040Z #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5186156Z #56 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5186279Z #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5186419Z #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5186541Z #59 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5186658Z #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5186779Z #61 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5186896Z #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5186980Z #63 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5187098Z #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5187239Z #65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5187356Z #66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5187517Z #67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5187629Z #68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5187741Z #69 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5187840Z #70 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5187939Z #71 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5188036Z #72 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5188165Z #73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5188279Z #74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5188393Z #75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5188490Z #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5188589Z #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5188682Z #78 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5188805Z #79 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5188915Z #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5189013Z #81 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5189104Z #82 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5189193Z #83 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5189242Z #84 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5189357Z #85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5189441Z #86 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5189554Z #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5189637Z #88 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5189748Z #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5189832Z #90 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5189944Z #91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5190067Z #92 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5190179Z #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5190277Z #94 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5190369Z #95 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5190466Z #96 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5190587Z #97 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5190702Z #98 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5190809Z #99 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5190929Z #100 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5191016Z #101 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5191134Z #102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5191218Z #103 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5191338Z #104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5191466Z #105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5191585Z #106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5191711Z #107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5191825Z #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5191952Z #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5192065Z #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5192163Z #111 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5192247Z #112 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5192364Z #113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5192488Z #114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5192604Z #115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5192728Z #116 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5192847Z #117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5192972Z #118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5193088Z #119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5193210Z #120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5193324Z #121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5193419Z #122 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5193523Z #123 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5193613Z #124 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5193714Z #125 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5193849Z #126 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5193943Z #127 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5194031Z #128 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5194119Z #129 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5194191Z #130 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5194232Z #131 _start from ??:0 2025-12-04T13:44:25.5194284Z #132 from ??:0 2025-12-04T13:44:25.5194286Z 2025-12-04T13:44:25.5194453Z [rank1]:[W1204 13:19:50.367474533 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5194475Z 2025-12-04T13:44:25.5194640Z [rank1]:[W1204 13:19:50.367503712 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5194643Z 2025-12-04T13:44:25.5194817Z [rank1]:[W1204 13:19:50.368978579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43782, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5194998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5195042Z C++ CapturedTraceback: 2025-12-04T13:44:25.5195418Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5195568Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5195687Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5195941Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5196019Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5196090Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5196147Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5196228Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5196232Z 2025-12-04T13:44:25.5196467Z [rank1]:[W1204 13:19:50.369085157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5196607Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] Caught exception: 2025-12-04T13:44:25.5196763Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] Traceback (most recent call last): 2025-12-04T13:44:25.5197047Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.5197197Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] getattr(self, test_name)() 2025-12-04T13:44:25.5197513Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.5197655Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] fn() 2025-12-04T13:44:25.5197921Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.5198063Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] method(*args, **kwargs) 2025-12-04T13:44:25.5198269Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.5198412Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwds) 2025-12-04T13:44:25.5198718Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.5198865Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.5199126Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 808, in test_fsdp_inductor 2025-12-04T13:44:25.5199266Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] outputs = fsdp_m(inputs) 2025-12-04T13:44:25.5199517Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441, in __call__ 2025-12-04T13:44:25.5199681Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return super().__call__(*args, **kwargs) 2025-12-04T13:44:25.5199951Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl 2025-12-04T13:44:25.5200111Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return self._call_impl(*args, **kwargs) 2025-12-04T13:44:25.5200365Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl 2025-12-04T13:44:25.5200526Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return forward_call(*args, **kwargs) 2025-12-04T13:44:25.5200786Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926, in compile_wrapper 2025-12-04T13:44:25.5200931Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.5201182Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 2194, in __call__ 2025-12-04T13:44:25.5201344Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] result = self._torchdynamo_orig_backend( 2025-12-04T13:44:25.5201595Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1937, in __call__ 2025-12-04T13:44:25.5201744Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] result = self._inner_convert( 2025-12-04T13:44:25.5202017Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 706, in __call__ 2025-12-04T13:44:25.5202151Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] result = _compile( 2025-12-04T13:44:25.5202403Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1807, in _compile 2025-12-04T13:44:25.5202557Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] raise InternalTorchDynamoError( 2025-12-04T13:44:25.5202811Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1744, in _compile 2025-12-04T13:44:25.5203030Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] guarded_code, tracer_output = compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5203282Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function 2025-12-04T13:44:25.5203433Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return function(*args, **kwargs) 2025-12-04T13:44:25.5203690Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1425, in compile_inner 2025-12-04T13:44:25.5203859Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return _compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5204127Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1459, in _compile_inner 2025-12-04T13:44:25.5204278Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] dynamo_output = compile_frame( 2025-12-04T13:44:25.5204536Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1341, in compile_frame 2025-12-04T13:44:25.5204728Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] bytecode, tracer_output = transform_code_object(code, transform) 2025-12-04T13:44:25.5205019Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1600, in transform_code_object 2025-12-04T13:44:25.5205206Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] tracer_output = transformations(instructions, code_options) 2025-12-04T13:44:25.5205462Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1313, in transform 2025-12-04T13:44:25.5205606Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] tracer_output = trace_frame( 2025-12-04T13:44:25.5205852Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 328, in _fn 2025-12-04T13:44:25.5205995Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.5206277Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 837, in trace_frame 2025-12-04T13:44:25.5206405Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] run_tracer() 2025-12-04T13:44:25.5206660Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 818, in run_tracer 2025-12-04T13:44:25.5206786Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] tracer.run() 2025-12-04T13:44:25.5207036Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1639, in run 2025-12-04T13:44:25.5207191Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] while self.step(): 2025-12-04T13:44:25.5207441Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1319, in step 2025-12-04T13:44:25.5207659Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] self.dispatch_table[inst.opcode](self, inst) 2025-12-04T13:44:25.5207913Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 856, in wrapper 2025-12-04T13:44:25.5208095Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return handle_graph_break(self, inst, speculation.reason) 2025-12-04T13:44:25.5208371Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 918, in handle_graph_break 2025-12-04T13:44:25.5208552Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] all_stack_locals_metadata = self.output.compile_subgraph( 2025-12-04T13:44:25.5208821Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1712, in compile_subgraph 2025-12-04T13:44:25.5208972Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] self.run_compiler_collective() 2025-12-04T13:44:25.5209246Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 2069, in run_compiler_collective 2025-12-04T13:44:25.5209447Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] dist.all_gather_object(all_states, ds.local_state, group=compile_pg) 2025-12-04T13:44:25.5209700Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5209847Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.5210128Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3248, in all_gather_object 2025-12-04T13:44:25.5210305Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] all_gather(object_size_list, local_size, group=group) 2025-12-04T13:44:25.5210581Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5210729Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return func(*args, **kwargs) 2025-12-04T13:44:25.5211002Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4015, in all_gather 2025-12-04T13:44:25.5211179Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] work = group.allgather([tensor_list], [tensor], opts) 2025-12-04T13:44:25.5211567Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] torch._dynamo.exc.InternalTorchDynamoError: DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe 2025-12-04T13:44:25.5211881Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5212019Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] C++ CapturedTraceback: 2025-12-04T13:44:25.5212505Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5212771Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5213005Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5213320Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #7 c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) from ??:0 2025-12-04T13:44:25.5213572Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #8 c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5213819Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #9 c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5214108Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #10 c10d::PrefixStore::get(std::__cxx11::basic_string, std::allocator > const&) [clone .localalias] from PrefixStore.cpp:0 2025-12-04T13:44:25.5214431Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #11 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) from ??:0 2025-12-04T13:44:25.5214744Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #12 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5215178Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #13 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5215715Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #14 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5217384Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #15 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5217898Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #16 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5219095Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #17 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5219523Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #18 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5224091Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #19 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5224327Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #20 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5224547Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #21 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5224758Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #22 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5225020Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5225267Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5225496Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5225705Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5225902Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5226133Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5226371Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5226601Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5226810Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5227007Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5227240Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5227486Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5227756Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5227992Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #36 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5228221Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #37 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5228489Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #38 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5228716Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5228952Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #40 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5229179Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #41 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5229439Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #42 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5229667Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5229907Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5230135Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5230370Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5230604Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5230839Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5231068Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5231278Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #50 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5231477Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #51 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5231707Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5231944Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5232176Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5232413Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5232643Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #56 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5232899Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5233126Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5233362Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #59 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5233588Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5233846Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #61 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5234074Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5234273Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #63 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5234503Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5234741Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5234970Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5235206Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5235434Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5235656Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #69 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5235870Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #70 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5236082Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #71 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5236290Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #72 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5236529Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5236756Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5236999Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5237210Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5237418Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5237666Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #78 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5237901Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #79 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5238169Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5238379Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #81 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5238586Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #82 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5238785Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #83 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5238940Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #84 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5239175Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5239374Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #86 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5239602Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5239800Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #88 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5240030Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5240230Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #90 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5240458Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5240657Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #92 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5240885Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5241098Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #94 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5241334Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #95 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5241545Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #96 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5241781Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #97 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5242009Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #98 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5242229Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #99 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5242461Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #100 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5242664Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #101 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5242894Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5243095Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #103 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5243326Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5243567Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5243797Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5244034Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5244264Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5244507Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5244740Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5244950Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #111 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5245150Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #112 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5245381Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5245637Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5245867Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5246105Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #116 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5246336Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5246596Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5246830Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5247070Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5247301Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5247556Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #122 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5247772Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #123 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5247973Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #124 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5248188Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #125 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5248413Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #126 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5248621Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #127 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5248825Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #128 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5249028Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #129 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5249209Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #130 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5249341Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #131 _start from ??:0 2025-12-04T13:44:25.5249493Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #132 from ??:0 2025-12-04T13:44:25.5249712Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] . This may indicate a possible application crash on rank 0 or a network set up issue. 2025-12-04T13:44:25.5250080Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] Exception raised from broadcastUniqueNCCLID at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2815 (most recent call first): 2025-12-04T13:44:25.5250217Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] C++ CapturedTraceback: 2025-12-04T13:44:25.5250709Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5250997Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5251362Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #6 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) [clone .cold] from ProcessGroupNCCL.cpp:0 2025-12-04T13:44:25.5251678Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #7 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5252093Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #8 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5252627Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #9 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5254264Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #10 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5254802Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #11 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5256049Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #12 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5256503Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #13 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5261167Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #14 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5261420Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5261684Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #16 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5261920Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #17 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5262176Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #18 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5262439Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5262677Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #20 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5262929Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #21 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5263136Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #22 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5263393Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5263652Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5263906Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5264158Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5264365Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5264618Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5264904Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5265137Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5265411Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5265649Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #32 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5265910Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5266161Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5266412Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #35 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5266678Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5266924Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5267177Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5267432Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5267696Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5267973Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5268212Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5268478Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5268742Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5268969Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #45 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5269186Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #46 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5269441Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5269735Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5269973Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5270227Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5270475Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5270744Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5270985Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5271249Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5271494Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5271753Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5272015Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5272226Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #58 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5272477Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5272729Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5272977Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5273268Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5273504Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5273749Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #64 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5273971Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #65 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5274212Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #66 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5274472Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #67 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5274720Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5274980Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5275211Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #70 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5275451Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #71 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5275688Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #72 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5275907Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #73 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5276171Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #74 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5276409Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #75 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5276648Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5276871Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5277098Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #78 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5277280Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #79 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5277563Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5277793Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #81 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5278062Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5278290Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #83 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5278528Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5278752Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #85 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5279033Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5279251Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #87 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5279503Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5279724Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #89 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5279952Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #90 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5280166Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #91 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5280446Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5280701Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5280910Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #94 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5281159Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #95 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5281365Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #96 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5281640Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #97 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5281848Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #98 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5282104Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5282366Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #100 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5282627Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5282910Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5283150Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5283409Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5283687Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5283904Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #106 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5284142Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #107 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5284384Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5284649Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5284904Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5285162Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #111 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5285432Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #112 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5285682Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5285939Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5286190Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5286432Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5329073Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #117 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5329295Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #118 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5329502Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #119 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5329775Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #120 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5330002Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #121 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5330212Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #122 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5330413Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #123 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5330648Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #124 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5330834Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #125 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5330970Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #126 _start from ??:0 2025-12-04T13:44:25.5331125Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] #127 from ??:0 2025-12-04T13:44:25.5331232Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5331337Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5331460Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] from user code: 2025-12-04T13:44:25.5331723Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 69, in inner 2025-12-04T13:44:25.5331869Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] return fn(*args, **kwargs) 2025-12-04T13:44:25.5331975Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5332301Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" 2025-12-04T13:44:25.5332405Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5332511Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5332697Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.5332955Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.5333057Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] 2025-12-04T13:44:25.5333261Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.5333416Z E1204 13:19:50.440000 60748 site-packages/torch/testing/_internal/common_distributed.py:935] exiting process 1 with exit code: 10 2025-12-04T13:44:25.5333584Z [rank2]:[W1204 13:19:50.635868295 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5333588Z 2025-12-04T13:44:25.5333780Z [rank2]:[W1204 13:19:50.635971033 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:43784, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5333960Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5334007Z C++ CapturedTraceback: 2025-12-04T13:44:25.5334387Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5334534Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5334674Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5334932Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5335011Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5335081Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5335139Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5335217Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5335220Z 2025-12-04T13:44:25.5335458Z [rank2]:[W1204 13:19:50.636091950 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5335626Z [rank3]:[W1204 13:19:51.964605154 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.5335630Z 2025-12-04T13:44:25.5335800Z [rank3]:[W1204 13:19:51.964663573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:41662, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5335976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5336021Z C++ CapturedTraceback: 2025-12-04T13:44:25.5336388Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5336537Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5336658Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5336910Z #7 c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) from ??:0 2025-12-04T13:44:25.5336987Z #8 c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() from ??:0 2025-12-04T13:44:25.5337054Z #9 std::error_code::default_error_condition() const from ??:0 2025-12-04T13:44:25.5337111Z #10 start_thread from ./nptl/pthread_create.c:442 2025-12-04T13:44:25.5337185Z #11 __clone3 from ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 2025-12-04T13:44:25.5337188Z 2025-12-04T13:44:25.5337423Z [rank3]:[W1204 13:19:51.965006325 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5337686Z [rank1]:[W1204 13:19:51.019674129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5337861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5338121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5338288Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5338691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5338894Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5339005Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5339104Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5339204Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5339206Z 2025-12-04T13:44:25.5339442Z [rank1]:[W1204 13:19:51.022071566 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5339616Z [rank2]:[W1204 13:19:51.354175520 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5339791Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5340046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5340212Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5340578Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5340783Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5340890Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5340988Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5341085Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5341089Z 2025-12-04T13:44:25.5341323Z [rank2]:[W1204 13:19:51.355824343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5341497Z [rank3]:[W1204 13:19:51.361284050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5341688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5341945Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5342106Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5342472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5342696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5342802Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5342899Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5342995Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5342997Z 2025-12-04T13:44:25.5343231Z [rank3]:[W1204 13:19:51.363840563 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5343275Z FAILED [38.7613s] [100%] 2025-12-04T13:44:25.5343277Z 2025-12-04T13:44:25.5343336Z =================================== FAILURES =================================== 2025-12-04T13:44:25.5343416Z _______________________ TestMultiProc.test_fsdp_inductor _______________________ 2025-12-04T13:44:25.5343465Z Traceback (most recent call last): 2025-12-04T13:44:25.5343632Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T13:44:25.5343678Z self._join_processes(fn) 2025-12-04T13:44:25.5343850Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T13:44:25.5343907Z self._check_return_codes(fn, elapsed_time) 2025-12-04T13:44:25.5344082Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1079, in _check_return_codes 2025-12-04T13:44:25.5344130Z raise RuntimeError(error) 2025-12-04T13:44:25.5344210Z RuntimeError: Process 1 exited with error code 10 and exception: 2025-12-04T13:44:25.5344255Z Traceback (most recent call last): 2025-12-04T13:44:25.5344415Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.5344458Z getattr(self, test_name)() 2025-12-04T13:44:25.5344615Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.5344650Z fn() 2025-12-04T13:44:25.5344801Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.5344843Z method(*args, **kwargs) 2025-12-04T13:44:25.5344932Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.5344973Z return func(*args, **kwds) 2025-12-04T13:44:25.5345133Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.5345175Z return func(*args, **kwargs) 2025-12-04T13:44:25.5345346Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 808, in test_fsdp_inductor 2025-12-04T13:44:25.5345389Z outputs = fsdp_m(inputs) 2025-12-04T13:44:25.5345523Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441, in __call__ 2025-12-04T13:44:25.5345572Z return super().__call__(*args, **kwargs) 2025-12-04T13:44:25.5345723Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl 2025-12-04T13:44:25.5345772Z return self._call_impl(*args, **kwargs) 2025-12-04T13:44:25.5345909Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl 2025-12-04T13:44:25.5345955Z return forward_call(*args, **kwargs) 2025-12-04T13:44:25.5346118Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926, in compile_wrapper 2025-12-04T13:44:25.5346159Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5346298Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 2194, in __call__ 2025-12-04T13:44:25.5346349Z result = self._torchdynamo_orig_backend( 2025-12-04T13:44:25.5346486Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1937, in __call__ 2025-12-04T13:44:25.5346530Z result = self._inner_convert( 2025-12-04T13:44:25.5346666Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 706, in __call__ 2025-12-04T13:44:25.5346705Z result = _compile( 2025-12-04T13:44:25.5346840Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1807, in _compile 2025-12-04T13:44:25.5346892Z raise InternalTorchDynamoError( 2025-12-04T13:44:25.5347025Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1744, in _compile 2025-12-04T13:44:25.5347111Z guarded_code, tracer_output = compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5347246Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function 2025-12-04T13:44:25.5347293Z return function(*args, **kwargs) 2025-12-04T13:44:25.5347438Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1425, in compile_inner 2025-12-04T13:44:25.5347542Z return _compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5347685Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1459, in _compile_inner 2025-12-04T13:44:25.5347732Z dynamo_output = compile_frame( 2025-12-04T13:44:25.5347874Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1341, in compile_frame 2025-12-04T13:44:25.5347957Z bytecode, tracer_output = transform_code_object(code, transform) 2025-12-04T13:44:25.5348131Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1600, in transform_code_object 2025-12-04T13:44:25.5348207Z tracer_output = transformations(instructions, code_options) 2025-12-04T13:44:25.5348343Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1313, in transform 2025-12-04T13:44:25.5348387Z tracer_output = trace_frame( 2025-12-04T13:44:25.5348517Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 328, in _fn 2025-12-04T13:44:25.5348560Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5348697Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 837, in trace_frame 2025-12-04T13:44:25.5348737Z run_tracer() 2025-12-04T13:44:25.5348872Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 818, in run_tracer 2025-12-04T13:44:25.5348910Z tracer.run() 2025-12-04T13:44:25.5349073Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1639, in run 2025-12-04T13:44:25.5349114Z while self.step(): 2025-12-04T13:44:25.5349248Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1319, in step 2025-12-04T13:44:25.5349304Z self.dispatch_table[inst.opcode](self, inst) 2025-12-04T13:44:25.5349442Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 856, in wrapper 2025-12-04T13:44:25.5349512Z return handle_graph_break(self, inst, speculation.reason) 2025-12-04T13:44:25.5349672Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 918, in handle_graph_break 2025-12-04T13:44:25.5351831Z all_stack_locals_metadata = self.output.compile_subgraph( 2025-12-04T13:44:25.5351982Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1712, in compile_subgraph 2025-12-04T13:44:25.5352028Z self.run_compiler_collective() 2025-12-04T13:44:25.5352186Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 2069, in run_compiler_collective 2025-12-04T13:44:25.5352269Z dist.all_gather_object(all_states, ds.local_state, group=compile_pg) 2025-12-04T13:44:25.5352409Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5352451Z return func(*args, **kwargs) 2025-12-04T13:44:25.5352617Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3248, in all_gather_object 2025-12-04T13:44:25.5352681Z all_gather(object_size_list, local_size, group=group) 2025-12-04T13:44:25.5352822Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5352863Z return func(*args, **kwargs) 2025-12-04T13:44:25.5353018Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4015, in all_gather 2025-12-04T13:44:25.5353083Z work = group.allgather([tensor_list], [tensor], opts) 2025-12-04T13:44:25.5353356Z torch._dynamo.exc.InternalTorchDynamoError: DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe 2025-12-04T13:44:25.5353529Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5353569Z C++ CapturedTraceback: 2025-12-04T13:44:25.5353942Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5354093Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5354213Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5354410Z #7 c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) from ??:0 2025-12-04T13:44:25.5354541Z #8 c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5354668Z #9 c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5354841Z #10 c10d::PrefixStore::get(std::__cxx11::basic_string, std::allocator > const&) [clone .localalias] from PrefixStore.cpp:0 2025-12-04T13:44:25.5355072Z #11 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) from ??:0 2025-12-04T13:44:25.5355266Z #12 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5355561Z #13 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5355978Z #14 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5357575Z #15 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5357929Z #16 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5359015Z #17 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5359297Z #18 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5363745Z #19 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5363867Z #20 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5363988Z #21 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5364087Z #22 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5364211Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5364335Z #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5364450Z #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5364543Z #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5364648Z #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5364762Z #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5364884Z #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5364997Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5365089Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5365170Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5365283Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5365401Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5365512Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5365633Z #36 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5365747Z #37 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5365867Z #38 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5365978Z #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5366097Z #40 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5366208Z #41 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5366327Z #42 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5366442Z #43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5366563Z #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5366676Z #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5366794Z #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5366905Z #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5367024Z #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5367134Z #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5367226Z #50 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5367307Z #51 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5367422Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5367604Z #53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5367716Z #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5367834Z #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5367944Z #56 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5368062Z #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5368173Z #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5368291Z #59 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5368436Z #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5368557Z #61 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5368669Z #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5368751Z #63 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5368863Z #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5368983Z #65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5369093Z #66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5369214Z #67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5369328Z #68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5369437Z #69 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5369534Z #70 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5369631Z #71 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5369725Z #72 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5369849Z #73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5369959Z #74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5370067Z #75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5370164Z #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5370260Z #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5370357Z #78 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5370480Z #79 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5370590Z #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5370685Z #81 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5370779Z #82 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5370862Z #83 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5370912Z #84 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5371023Z #85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5371109Z #86 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5371240Z #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5371324Z #88 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5371435Z #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5371517Z #90 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5371628Z #91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5371710Z #92 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5371820Z #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5371939Z #94 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5372030Z #95 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5372125Z #96 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5372246Z #97 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5372359Z #98 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5372442Z #99 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5372560Z #100 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5372645Z #101 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5372761Z #102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5372846Z #103 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5372961Z #104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5373086Z #105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5373198Z #106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5373323Z #107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5373434Z #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5373557Z #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5373668Z #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5373764Z #111 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5373847Z #112 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5373964Z #113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5374086Z #114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5374200Z #115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5374322Z #116 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5374437Z #117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5374558Z #118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5374674Z #119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5374815Z #120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5374928Z #121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5375020Z #122 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5375116Z #123 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5375201Z #124 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5375301Z #125 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5375411Z #126 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5375503Z #127 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5375606Z #128 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5375694Z #129 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5375763Z #130 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5375803Z #131 _start from ??:0 2025-12-04T13:44:25.5375851Z #132 from ??:0 2025-12-04T13:44:25.5375955Z . This may indicate a possible application crash on rank 0 or a network set up issue. 2025-12-04T13:44:25.5376181Z Exception raised from broadcastUniqueNCCLID at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2815 (most recent call first): 2025-12-04T13:44:25.5376223Z C++ CapturedTraceback: 2025-12-04T13:44:25.5376598Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5376748Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5376993Z #6 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) [clone .cold] from ProcessGroupNCCL.cpp:0 2025-12-04T13:44:25.5377190Z #7 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5377518Z #8 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5377935Z #9 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5379492Z #10 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5379845Z #11 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5380944Z #12 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5381229Z #13 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5385680Z #14 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5385798Z #15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5385901Z #16 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5385998Z #17 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5386123Z #18 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5386244Z #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5386358Z #20 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5386451Z #21 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5386534Z #22 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5386647Z #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5386769Z #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5386881Z #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5386972Z #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5387055Z #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5387165Z #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5387287Z #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5387398Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5387593Z #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5387704Z #32 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5387826Z #33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5387937Z #34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5388059Z #35 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5388169Z #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5388290Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5388424Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5388548Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5388659Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5388780Z #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5388892Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5389011Z #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5389123Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5389215Z #45 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5389298Z #46 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5389412Z #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5389532Z #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5389643Z #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5389764Z #50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5389874Z #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5389994Z #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5390104Z #53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5390227Z #54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5390338Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5390459Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5390571Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5390652Z #58 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5390765Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5390884Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5390995Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5391116Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5391246Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5391354Z #64 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5391451Z #65 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5391544Z #66 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5391638Z #67 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5391758Z #68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5391870Z #69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5391996Z #70 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5392093Z #71 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5392185Z #72 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5392280Z #73 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5392400Z #74 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5392513Z #75 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5392626Z #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5392716Z #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5392801Z #78 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5392849Z #79 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5392962Z #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5393045Z #81 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5393158Z #82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5393240Z #83 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5393352Z #84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5393433Z #85 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5393546Z #86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5393627Z #87 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5393740Z #88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5393836Z #89 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5393931Z #90 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5394024Z #91 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5394146Z #92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5394256Z #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5394339Z #94 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5394450Z #95 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5394532Z #96 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5394646Z #97 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5394745Z #98 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5394856Z #99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5394981Z #100 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5395097Z #101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5395220Z #102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5395337Z #103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5395458Z #104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5395594Z #105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5395688Z #106 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5395774Z #107 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5395887Z #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5396010Z #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5396123Z #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5396246Z #111 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5396359Z #112 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5396483Z #113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5396598Z #114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5396720Z #115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5396835Z #116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5396927Z #117 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5397025Z #118 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5397109Z #119 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5397210Z #120 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5397321Z #121 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5397416Z #122 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5397527Z #123 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5397616Z #124 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5397683Z #125 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5397724Z #126 _start from ??:0 2025-12-04T13:44:25.5397770Z #127 from ??:0 2025-12-04T13:44:25.5397772Z 2025-12-04T13:44:25.5397775Z 2025-12-04T13:44:25.5397813Z from user code: 2025-12-04T13:44:25.5397955Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 69, in inner 2025-12-04T13:44:25.5397999Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5398002Z 2025-12-04T13:44:25.5398212Z Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" 2025-12-04T13:44:25.5398214Z 2025-12-04T13:44:25.5398216Z 2025-12-04T13:44:25.5398324Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.5398464Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.5398466Z 2025-12-04T13:44:25.5398553Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.5398555Z 2025-12-04T13:44:25.5398557Z 2025-12-04T13:44:25.5398637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:44:25.5398724Z Process 1 terminated with exit code 10, terminating remaining processes. 2025-12-04T13:44:25.5398974Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-002c6ad5ba4d16ec.xml - 2025-12-04T13:44:25.5399063Z =========================== short test summary info ============================ 2025-12-04T13:44:25.5399265Z FAILED [38.7613s] distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor - RuntimeError: Process 1 exited with error code 10 and exception: 2025-12-04T13:44:25.5399313Z Traceback (most recent call last): 2025-12-04T13:44:25.5399479Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925, in run_test 2025-12-04T13:44:25.5399523Z getattr(self, test_name)() 2025-12-04T13:44:25.5399685Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772, in wrapper 2025-12-04T13:44:25.5399722Z fn() 2025-12-04T13:44:25.5399873Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T13:44:25.5399916Z method(*args, **kwargs) 2025-12-04T13:44:25.5400007Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T13:44:25.5400050Z return func(*args, **kwds) 2025-12-04T13:44:25.5400208Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227, in wrapper 2025-12-04T13:44:25.5400252Z return func(*args, **kwargs) 2025-12-04T13:44:25.5400395Z File "/var/lib/jenkins/pytorch/test/distributed/test_dynamo_distributed.py", line 808, in test_fsdp_inductor 2025-12-04T13:44:25.5400440Z outputs = fsdp_m(inputs) 2025-12-04T13:44:25.5400574Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441, in __call__ 2025-12-04T13:44:25.5400625Z return super().__call__(*args, **kwargs) 2025-12-04T13:44:25.5400776Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl 2025-12-04T13:44:25.5400826Z return self._call_impl(*args, **kwargs) 2025-12-04T13:44:25.5400965Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl 2025-12-04T13:44:25.5401012Z return forward_call(*args, **kwargs) 2025-12-04T13:44:25.5401156Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926, in compile_wrapper 2025-12-04T13:44:25.5401199Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5401335Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 2194, in __call__ 2025-12-04T13:44:25.5401387Z result = self._torchdynamo_orig_backend( 2025-12-04T13:44:25.5401523Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1937, in __call__ 2025-12-04T13:44:25.5401568Z result = self._inner_convert( 2025-12-04T13:44:25.5401702Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 706, in __call__ 2025-12-04T13:44:25.5401743Z result = _compile( 2025-12-04T13:44:25.5401881Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1807, in _compile 2025-12-04T13:44:25.5401931Z raise InternalTorchDynamoError( 2025-12-04T13:44:25.5402087Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1744, in _compile 2025-12-04T13:44:25.5402170Z guarded_code, tracer_output = compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5402306Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function 2025-12-04T13:44:25.5402349Z return function(*args, **kwargs) 2025-12-04T13:44:25.5402495Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1425, in compile_inner 2025-12-04T13:44:25.5402552Z return _compile_inner(code, one_graph, hooks) 2025-12-04T13:44:25.5402697Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1459, in _compile_inner 2025-12-04T13:44:25.5402762Z dynamo_output = compile_frame( 2025-12-04T13:44:25.5402908Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1341, in compile_frame 2025-12-04T13:44:25.5402987Z bytecode, tracer_output = transform_code_object(code, transform) 2025-12-04T13:44:25.5403162Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1600, in transform_code_object 2025-12-04T13:44:25.5403236Z tracer_output = transformations(instructions, code_options) 2025-12-04T13:44:25.5403377Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1313, in transform 2025-12-04T13:44:25.5403420Z tracer_output = trace_frame( 2025-12-04T13:44:25.5403549Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 328, in _fn 2025-12-04T13:44:25.5403592Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5403731Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 837, in trace_frame 2025-12-04T13:44:25.5403768Z run_tracer() 2025-12-04T13:44:25.5403907Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 818, in run_tracer 2025-12-04T13:44:25.5403943Z tracer.run() 2025-12-04T13:44:25.5404078Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1639, in run 2025-12-04T13:44:25.5404119Z while self.step(): 2025-12-04T13:44:25.5404254Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1319, in step 2025-12-04T13:44:25.5404312Z self.dispatch_table[inst.opcode](self, inst) 2025-12-04T13:44:25.5404449Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 856, in wrapper 2025-12-04T13:44:25.5404524Z return handle_graph_break(self, inst, speculation.reason) 2025-12-04T13:44:25.5404679Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 918, in handle_graph_break 2025-12-04T13:44:25.5404753Z all_stack_locals_metadata = self.output.compile_subgraph( 2025-12-04T13:44:25.5404900Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1712, in compile_subgraph 2025-12-04T13:44:25.5404947Z self.run_compiler_collective() 2025-12-04T13:44:25.5405101Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 2069, in run_compiler_collective 2025-12-04T13:44:25.5405183Z dist.all_gather_object(all_states, ds.local_state, group=compile_pg) 2025-12-04T13:44:25.5405322Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5405364Z return func(*args, **kwargs) 2025-12-04T13:44:25.5405531Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3248, in all_gather_object 2025-12-04T13:44:25.5405598Z all_gather(object_size_list, local_size, group=group) 2025-12-04T13:44:25.5405754Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 2025-12-04T13:44:25.5405799Z return func(*args, **kwargs) 2025-12-04T13:44:25.5405954Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4015, in all_gather 2025-12-04T13:44:25.5406020Z work = group.allgather([tensor_list], [tensor], opts) 2025-12-04T13:44:25.5406294Z torch._dynamo.exc.InternalTorchDynamoError: DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe 2025-12-04T13:44:25.5406470Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5406537Z C++ CapturedTraceback: 2025-12-04T13:44:25.5406909Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5407056Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5407174Z #6 void c10d::tcputil::sendBytes(int, unsigned char const*, unsigned long, bool) from :0 2025-12-04T13:44:25.5407374Z #7 c10d::TCPStore::doWait(c10::ArrayRef, std::allocator > >, std::chrono::duration >) from ??:0 2025-12-04T13:44:25.5407614Z #8 c10d::TCPStore::doGet(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5407744Z #9 c10d::TCPStore::get(std::__cxx11::basic_string, std::allocator > const&) from ??:0 2025-12-04T13:44:25.5407916Z #10 c10d::PrefixStore::get(std::__cxx11::basic_string, std::allocator > const&) [clone .localalias] from PrefixStore.cpp:0 2025-12-04T13:44:25.5408121Z #11 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) from ??:0 2025-12-04T13:44:25.5408318Z #12 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5408614Z #13 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5409034Z #14 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5410586Z #15 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5410940Z #16 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5412045Z #17 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5412331Z #18 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5416812Z #19 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5416928Z #20 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5417035Z #21 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5417131Z #22 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5417256Z #23 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5417379Z #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5417528Z #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5417622Z #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5417707Z #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5417821Z #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5417944Z #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5418055Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5418148Z #31 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5418230Z #32 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5418340Z #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5418464Z #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5418574Z #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5418722Z #36 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5418834Z #37 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5418955Z #38 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5419065Z #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5419186Z #40 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5419298Z #41 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5419444Z #42 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5419554Z #43 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5419676Z #44 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5419790Z #45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5419911Z #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5420022Z #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5420141Z #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5420253Z #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5420346Z #50 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5420430Z #51 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5420541Z #52 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5420661Z #53 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5420771Z #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5420891Z #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5421001Z #56 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5421123Z #57 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5421235Z #58 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5421355Z #59 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5421466Z #60 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5421588Z #61 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5421698Z #62 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5421782Z #63 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5421895Z #64 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5422014Z #65 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5422127Z #66 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5422246Z #67 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5422386Z #68 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5422494Z #69 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5422592Z #70 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5422687Z #71 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5422781Z #72 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5422901Z #73 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5423014Z #74 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5423143Z #75 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5423241Z #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5423332Z #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5423426Z #78 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5423545Z #79 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5423659Z #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5423754Z #81 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5423846Z #82 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5423933Z #83 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5423979Z #84 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5424094Z #85 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5424177Z #86 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5424290Z #87 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5424373Z #88 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5424484Z #89 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5424565Z #90 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5424678Z #91 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5424761Z #92 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5424873Z #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5424969Z #94 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5425061Z #95 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5425154Z #96 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5425276Z #97 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5425387Z #98 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5425470Z #99 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5425585Z #100 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5425671Z #101 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5425787Z #102 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5425888Z #103 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5426006Z #104 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5426131Z #105 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5426245Z #106 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5426369Z #107 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5426483Z #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5426631Z #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5426745Z #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5426839Z #111 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5426925Z #112 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5427039Z #113 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5427165Z #114 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5427277Z #115 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5427399Z #116 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5427547Z #117 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5427671Z #118 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5427786Z #119 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5427908Z #120 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5428022Z #121 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5428114Z #122 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5428213Z #123 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5428299Z #124 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5428399Z #125 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5428509Z #126 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5428605Z #127 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5428689Z #128 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5428777Z #129 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5428845Z #130 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5428887Z #131 _start from ??:0 2025-12-04T13:44:25.5428933Z #132 from ??:0 2025-12-04T13:44:25.5429035Z . This may indicate a possible application crash on rank 0 or a network set up issue. 2025-12-04T13:44:25.5429259Z Exception raised from broadcastUniqueNCCLID at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2815 (most recent call first): 2025-12-04T13:44:25.5429304Z C++ CapturedTraceback: 2025-12-04T13:44:25.5429701Z #4 std::_Function_handler, std::allocator > > const> (), c10::SetStackTraceFetcher(std::function, std::allocator > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 2025-12-04T13:44:25.5429849Z #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) from ??:0 2025-12-04T13:44:25.5430094Z #6 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string, std::allocator > const&, int) [clone .cold] from ProcessGroupNCCL.cpp:0 2025-12-04T13:44:25.5430288Z #7 c10d::ProcessGroupNCCL::initNCCLComm(std::__cxx11::basic_string, std::allocator > const&, c10::Device&, c10d::OpType, int, bool) from ??:0 2025-12-04T13:44:25.5430611Z #8 c10d::ProcessGroupNCCL::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from ??:0 2025-12-04T13:44:25.5431023Z #9 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from Ops.cpp:0 2025-12-04T13:44:25.5432541Z #10 c10::impl::make_boxed_from_unboxed_functor >, std::allocator > > >, c10::intrusive_ptr > > (*)(std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), std::tuple >, std::allocator > > >, c10::intrusive_ptr > >, c10::guts::typelist::typelist >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from :0 2025-12-04T13:44:25.5432896Z #11 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >*) from autograd_not_implemented_fallback.cpp:0 2025-12-04T13:44:25.5433997Z #12 c10::impl::BoxedKernelWrapper >, std::allocator > > >, c10::intrusive_ptr > > (std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector >, std::allocator > > > const&, c10::ArrayRef, c10::intrusive_ptr > const&, bool, long) from :0 2025-12-04T13:44:25.5434278Z #13 c10d::ProcessGroup::allgather(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&) from :0 2025-12-04T13:44:25.5438758Z #14 pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr >, c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(pybind11::cpp_function::initialize >, c10d::ProcessGroup, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard, char [148]>(c10::intrusive_ptr > (c10d::ProcessGroup::*)(std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr > (*)(c10d::ProcessGroup*, std::vector >, std::allocator > > >&, std::vector >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard const&, char const (&) [148])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 2025-12-04T13:44:25.5438874Z #15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0 2025-12-04T13:44:25.5438975Z #16 cfunction_call from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:543 2025-12-04T13:44:25.5439071Z #17 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5439195Z #18 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5439342Z #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5439454Z #20 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5439549Z #21 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5439631Z #22 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5439744Z #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5439867Z #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5439977Z #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5440068Z #26 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5440149Z #27 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5440263Z #28 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5440385Z #29 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5440497Z #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5440616Z #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5440727Z #32 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5440846Z #33 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5440958Z #34 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5441077Z #35 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5441192Z #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5441315Z #37 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5441427Z #38 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5441546Z #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5441659Z #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5441779Z #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5441889Z #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5442008Z #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5442120Z #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5442232Z #45 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5442314Z #46 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5442426Z #47 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5442545Z #48 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5442657Z #49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5442776Z #50 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5442888Z #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5443034Z #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5443150Z #53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5443269Z #54 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5443383Z #55 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5443505Z #56 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5443617Z #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5443700Z #58 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5443810Z #59 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5443933Z #60 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5444042Z #61 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5444164Z #62 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5444274Z #63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5444383Z #64 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5444477Z #65 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5444571Z #66 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5444664Z #67 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5444788Z #68 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5444898Z #69 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5445006Z #70 _PyObject_FastCallDictTstate from /usr/local/src/conda/python-3.10.14/Objects/call.c:153 2025-12-04T13:44:25.5445101Z #71 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5445194Z #72 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5445285Z #73 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5445408Z #74 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5445521Z #75 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5445616Z #76 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5445712Z #77 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5445796Z #78 _PyObject_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:305 2025-12-04T13:44:25.5445863Z #79 dynamo__custom_eval_frame from :0 2025-12-04T13:44:25.5445974Z #80 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5446057Z #81 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5446166Z #82 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5446249Z #83 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5446360Z #84 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5446443Z #85 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5446588Z #86 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5446670Z #87 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5446781Z #88 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5446878Z #89 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.14/Objects/call.c:431 2025-12-04T13:44:25.5446968Z #90 slot_tp_call from /usr/local/src/conda/python-3.10.14/Objects/typeobject.c:7494 2025-12-04T13:44:25.5447062Z #91 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.14/Objects/call.c:215 2025-12-04T13:44:25.5447182Z #92 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:112 2025-12-04T13:44:25.5447293Z #93 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5447374Z #94 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5447534Z #95 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5447616Z #96 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5447731Z #97 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5447814Z #98 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5447924Z #99 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5448049Z #100 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5448165Z #101 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5448288Z #102 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5448404Z #103 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5448528Z #104 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5448642Z #105 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5448737Z #106 PyVectorcall_Call from /usr/local/src/conda/python-3.10.14/Objects/call.c:267 2025-12-04T13:44:25.5448820Z #107 do_call_core from /usr/local/src/conda/python-3.10.14/Python/ceval.c:5945 2025-12-04T13:44:25.5448935Z #108 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5449057Z #109 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5449170Z #110 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5449292Z #111 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5449407Z #112 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5449555Z #113 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5449668Z #114 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5449791Z #115 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 2025-12-04T13:44:25.5449904Z #116 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 2025-12-04T13:44:25.5449998Z #117 PyEval_EvalCode from /usr/local/src/conda/python-3.10.14/Python/ceval.c:1134 2025-12-04T13:44:25.5450093Z #118 run_eval_code_obj from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1291 2025-12-04T13:44:25.5450207Z #119 run_mod from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1312 2025-12-04T13:44:25.5450307Z #120 PyRun_StringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:1183 2025-12-04T13:44:25.5450417Z #121 PyRun_SimpleStringFlags from /usr/local/src/conda/python-3.10.14/Python/pythonrun.c:503 2025-12-04T13:44:25.5450509Z #122 pymain_run_command from /usr/local/src/conda/python-3.10.14/Modules/main.c:252 2025-12-04T13:44:25.5450595Z #123 Py_BytesMain from /usr/local/src/conda/python-3.10.14/Modules/main.c:1090 2025-12-04T13:44:25.5450680Z #124 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 2025-12-04T13:44:25.5450749Z #125 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 2025-12-04T13:44:25.5450787Z #126 _start from ??:0 2025-12-04T13:44:25.5450837Z #127 from ??:0 2025-12-04T13:44:25.5450839Z 2025-12-04T13:44:25.5450841Z 2025-12-04T13:44:25.5450878Z from user code: 2025-12-04T13:44:25.5451020Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 69, in inner 2025-12-04T13:44:25.5451064Z return fn(*args, **kwargs) 2025-12-04T13:44:25.5451070Z 2025-12-04T13:44:25.5451279Z Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" 2025-12-04T13:44:25.5451282Z 2025-12-04T13:44:25.5451283Z 2025-12-04T13:44:25.5451361Z To execute this test, run the following from the base repo dir: 2025-12-04T13:44:25.5451500Z PYTORCH_TEST_WITH_ROCM=1 python test/distributed/test_dynamo_distributed.py TestMultiProc.test_fsdp_inductor 2025-12-04T13:44:25.5451502Z 2025-12-04T13:44:25.5451590Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T13:44:25.5451652Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T13:44:25.5451717Z ====================== 1 failed, 61 deselected in 38.77s ======================= 2025-12-04T13:44:25.5451895Z [rank1]:[W1204 13:19:52.022176872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5452076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5452338Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5452505Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5452874Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5453096Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5453208Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5453306Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5453408Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5453410Z 2025-12-04T13:44:25.5453650Z [rank1]:[W1204 13:19:52.024186037 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5453691Z Got exit code 1 2025-12-04T13:44:25.5453757Z Retrying single test... 2025-12-04T13:44:25.5453929Z [rank2]:[W1204 13:19:52.355997168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5454107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5454363Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5454527Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5454893Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5455098Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5455204Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5455303Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5455400Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5455405Z 2025-12-04T13:44:25.5455638Z [rank2]:[W1204 13:19:52.357721760 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5455813Z [rank3]:[W1204 13:19:52.363991809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5455989Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5456243Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5456403Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5456767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5456995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5457101Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5457197Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5457293Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5457295Z 2025-12-04T13:44:25.5457569Z [rank3]:[W1204 13:19:52.365651792 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5457737Z [rank1]:[W1204 13:19:53.024676417 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5457941Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5458197Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5458362Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5458725Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5458928Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5459034Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5459129Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5459226Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5459228Z 2025-12-04T13:44:25.5459460Z [rank1]:[W1204 13:19:53.026832298 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5459629Z [rank2]:[W1204 13:19:53.357890277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5459807Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5460062Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5460226Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5460591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5460795Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5460901Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5461021Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5461121Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5461122Z 2025-12-04T13:44:25.5461353Z [rank2]:[W1204 13:19:53.360118547 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5461523Z [rank3]:[W1204 13:19:53.365806029 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5461695Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5461973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5462134Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5462501Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5462703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5462808Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5462906Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5463000Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5463002Z 2025-12-04T13:44:25.5463233Z [rank3]:[W1204 13:19:53.367981800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5463401Z [rank1]:[W1204 13:19:54.026961737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5463574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5463831Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5463994Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5464358Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5464558Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5464665Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5464759Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5464873Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5464875Z 2025-12-04T13:44:25.5465106Z [rank1]:[W1204 13:19:54.029415862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5465275Z [rank2]:[W1204 13:19:54.360254625 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5465449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5465722Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5465886Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5466249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5466451Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5466555Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5466654Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5466754Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5466756Z 2025-12-04T13:44:25.5466988Z [rank2]:[W1204 13:19:54.361996736 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5467157Z [rank3]:[W1204 13:19:54.368164488 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5467330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5467611Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5467775Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5468142Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5468344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5468448Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5468545Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5468641Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5468643Z 2025-12-04T13:44:25.5468902Z [rank3]:[W1204 13:19:54.370461086 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5469071Z [rank1]:[W1204 13:19:55.029599110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5469247Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5469498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5469687Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5470052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5470251Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5470356Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5470450Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5470548Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5470550Z 2025-12-04T13:44:25.5470783Z [rank1]:[W1204 13:19:55.031247953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5470951Z [rank2]:[W1204 13:19:55.362162185 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5471128Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5471379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5471545Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5471910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5472112Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5472215Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5472311Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5472410Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5472412Z 2025-12-04T13:44:25.5472667Z [rank2]:[W1204 13:19:55.363583074 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5472837Z [rank3]:[W1204 13:19:55.370586996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5473009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5473269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5473431Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5473817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5474018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5474121Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5474217Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5474311Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5474315Z 2025-12-04T13:44:25.5474548Z [rank3]:[W1204 13:19:55.371761290 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5474716Z [rank1]:[W1204 13:19:56.031415843 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5474890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5475145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5475306Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5475680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5475879Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5475981Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5476076Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5476174Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5476178Z 2025-12-04T13:44:25.5476410Z [rank1]:[W1204 13:19:56.033188753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5476598Z [rank2]:[W1204 13:19:56.363754363 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5476772Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5477024Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5477186Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5477611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5477815Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5477920Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5478015Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5478113Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5478115Z 2025-12-04T13:44:25.5478349Z [rank2]:[W1204 13:19:56.365540823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5478521Z [rank3]:[W1204 13:19:56.371915161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5478696Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5478953Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5479113Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5479480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5479684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5479787Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5479885Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5479981Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5479982Z 2025-12-04T13:44:25.5480216Z [rank3]:[W1204 13:19:56.373542624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5480444Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-c043a2bb54ab4c8d.xml 2025-12-04T13:44:25.5480507Z ============================= test session starts ============================== 2025-12-04T13:44:25.5480619Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.5480661Z cachedir: .pytest_cache 2025-12-04T13:44:25.5480819Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.5480867Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.5480909Z configfile: pytest.ini 2025-12-04T13:44:25.5481070Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.5481173Z collecting ... collected 62 items / 61 deselected / 1 selected 2025-12-04T13:44:25.5481348Z stepcurrent: skipping 25 already run items. Running only test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor 2025-12-04T13:44:25.5481396Z Running 1 items in this shard 2025-12-04T13:44:25.5481398Z 2025-12-04T13:44:25.5481652Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor I1204 13:19:57.015000 63410 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 64066 2025-12-04T13:44:25.5481804Z I1204 13:19:57.016000 63410 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 64067 2025-12-04T13:44:25.5481953Z I1204 13:19:57.016000 63410 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 64068 2025-12-04T13:44:25.5482104Z I1204 13:19:57.017000 63410 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 64069 2025-12-04T13:44:25.5482280Z [rank1]:[W1204 13:19:57.033355974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5482456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5482711Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5482872Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5483236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5483438Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5483546Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5483641Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5483738Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5483740Z 2025-12-04T13:44:25.5483973Z [rank1]:[W1204 13:19:57.036677580 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5484143Z [rank2]:[W1204 13:19:57.365682905 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5484337Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5484590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5484753Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5485117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5485345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5485451Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5485546Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5485644Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5485646Z 2025-12-04T13:44:25.5485876Z [rank2]:[W1204 13:19:57.366889508 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5486045Z [rank3]:[W1204 13:19:57.373675806 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5486219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5486478Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5486641Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5487005Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5487208Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5487313Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5487409Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5487543Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5487547Z 2025-12-04T13:44:25.5487781Z [rank3]:[W1204 13:19:57.375334569 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5487950Z [rank1]:[W1204 13:19:58.036877941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5488127Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5488405Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5488567Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5488935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5489134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5489262Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5489358Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5489455Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5489457Z 2025-12-04T13:44:25.5489688Z [rank1]:[W1204 13:19:58.039211909 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5489856Z [rank2]:[W1204 13:19:58.367049601 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5490029Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5490285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5490448Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5490814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5491016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5491123Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5491220Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5491342Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5491344Z 2025-12-04T13:44:25.5491637Z [rank2]:[W1204 13:19:58.369235301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5491816Z [rank3]:[W1204 13:19:58.375456972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5492021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5492286Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5492485Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5492890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5493108Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5493251Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5493357Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5493472Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5493474Z 2025-12-04T13:44:25.5493727Z [rank3]:[W1204 13:19:58.377722241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5493935Z [rank1]:[W1204 13:19:59.039415401 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5494131Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5494395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5494578Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5494961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5495194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5495321Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5495428Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5495540Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5495543Z 2025-12-04T13:44:25.5495799Z [rank1]:[W1204 13:19:59.041309798 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5495996Z [rank2]:[W1204 13:19:59.369412495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5496179Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5496456Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5496653Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5497043Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5497271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5497386Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5497573Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5497681Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5497683Z 2025-12-04T13:44:25.5497949Z [rank2]:[W1204 13:19:59.371257643 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5498151Z [rank3]:[W1204 13:19:59.377876205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5498336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5498619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5498793Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5499182Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5499397Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5499523Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5499648Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5499757Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5499759Z 2025-12-04T13:44:25.5500019Z [rank3]:[W1204 13:19:59.379472079 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5500206Z [rank1]:[W1204 13:20:00.041464143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5500406Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5500671Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5500858Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5501276Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5501493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5501628Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5501733Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5501851Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5501878Z 2025-12-04T13:44:25.5502118Z [rank1]:[W1204 13:20:00.043472798 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5502328Z [rank2]:[W1204 13:20:00.371403878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5502524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5502791Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5502976Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5503350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5503592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5503706Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5503823Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5503942Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5503945Z 2025-12-04T13:44:25.5504182Z [rank2]:[W1204 13:20:00.372876365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5504398Z [rank3]:[W1204 13:20:00.379592425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5504582Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5504857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5505028Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5505438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5505674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5505790Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5505907Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5506022Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5506024Z 2025-12-04T13:44:25.5506274Z [rank3]:[W1204 13:20:00.381030842 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5506481Z [rank1]:[W1204 13:20:01.043578244 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5506682Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5506968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5507140Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5507572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5507794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5507927Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5508038Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5508156Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5508158Z 2025-12-04T13:44:25.5508405Z [rank1]:[W1204 13:20:01.044724018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5508597Z [rank2]:[W1204 13:20:01.373495710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5508800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5509069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5509256Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5509630Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5509885Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5510024Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5510135Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5510255Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5510257Z 2025-12-04T13:44:25.5510499Z [rank2]:[W1204 13:20:01.374789832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5510724Z [rank3]:[W1204 13:20:01.381150799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5510914Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5511191Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5511373Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5511746Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5511986Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5512108Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5512228Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5512333Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5512348Z 2025-12-04T13:44:25.5512593Z [rank3]:[W1204 13:20:01.382382311 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5512797Z [rank1]:[W1204 13:20:02.044827736 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5512988Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5513264Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5513436Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5513827Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5514080Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5514202Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5514320Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5514426Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5514428Z 2025-12-04T13:44:25.5514691Z [rank1]:[W1204 13:20:02.046061698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5514868Z [rank2]:[W1204 13:20:02.374927749 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5515101Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5515384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5515558Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5515946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5516155Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5516299Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5516404Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5516528Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5516530Z 2025-12-04T13:44:25.5516784Z [rank2]:[W1204 13:20:02.376363416 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5516957Z [rank3]:[W1204 13:20:02.382498599 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5517168Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5517442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5517671Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5518049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5518267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5518444Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5518551Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5518670Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5518672Z 2025-12-04T13:44:25.5518913Z [rank3]:[W1204 13:20:02.383879038 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5519099Z [rank1]:[W1204 13:20:03.046223676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5519321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5519606Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5519792Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5520168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5520392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5520520Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5520645Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5520751Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5520754Z 2025-12-04T13:44:25.5521011Z [rank1]:[W1204 13:20:03.047451518 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5521203Z [rank2]:[W1204 13:20:03.376496965 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5521397Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5521682Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5521855Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5522243Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5522462Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5522594Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5522735Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5522841Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5522843Z 2025-12-04T13:44:25.5523101Z [rank2]:[W1204 13:20:03.378622117 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5523280Z [rank3]:[W1204 13:20:03.383984577 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5523480Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5523768Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5523965Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5524351Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5524565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5524698Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5524808Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5524935Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5524937Z 2025-12-04T13:44:25.5525180Z [rank3]:[W1204 13:20:03.385155801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5525372Z [rank1]:[W1204 13:20:04.047560458 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5525578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5525846Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5526040Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5526413Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5526635Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5526768Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5526884Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5527007Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5527029Z 2025-12-04T13:44:25.5527271Z [rank1]:[W1204 13:20:04.049600992 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5527465Z [rank2]:[W1204 13:20:04.378743907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5527687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5527983Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5528184Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5528574Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5528805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5528914Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5529050Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5529159Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5529161Z 2025-12-04T13:44:25.5529418Z [rank2]:[W1204 13:20:04.380339261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5529614Z [rank3]:[W1204 13:20:04.385297850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5529792Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5530087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5530262Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5530649Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5530864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5530986Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5531117Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5531226Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5531228Z 2025-12-04T13:44:25.5531511Z [rank3]:[W1204 13:20:04.387527490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5531691Z [rank1]:[W1204 13:20:05.049778531 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5531882Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5532154Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5532371Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5532761Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5532973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5533094Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5533208Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5533338Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5533341Z 2025-12-04T13:44:25.5533585Z [rank1]:[W1204 13:20:05.051979882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5533815Z [rank3]:W1204 13:20:05.429000 64069 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.5534006Z [rank2]:[W1204 13:20:05.380499431 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5534199Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5534488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5534664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5535055Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5535265Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5535396Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5535525Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5535633Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5535635Z 2025-12-04T13:44:25.5535907Z [rank2]:[W1204 13:20:05.382888647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5536089Z [rank3]:[W1204 13:20:05.387675690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5536293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5536562Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5536767Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5537156Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5537373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5537550Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5537659Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5537780Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5537782Z 2025-12-04T13:44:25.5538026Z [rank3]:[W1204 13:20:05.389720534 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5538264Z [rank1]:W1204 13:20:06.061000 64067 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.5538458Z [rank1]:[W1204 13:20:06.052120033 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5538646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5538920Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5539095Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5539488Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5539718Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5539838Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5539958Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5540069Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5540095Z 2025-12-04T13:44:25.5540353Z [rank1]:[W1204 13:20:06.054063389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5540526Z [rank2]:[W1204 13:20:06.383040518 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5540737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5541002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5541222Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5541610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5541817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5541968Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5542073Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5542197Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5542198Z 2025-12-04T13:44:25.5542441Z [rank2]:[W1204 13:20:06.385280568 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5542629Z [rank3]:[W1204 13:20:06.389913244 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5542847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5543112Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5543296Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5543671Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5543887Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5544035Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5544140Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5544260Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5544262Z 2025-12-04T13:44:25.5544523Z [rank3]:[W1204 13:20:06.392075456 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5544714Z [rank1]:[W1204 13:20:07.054227271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5544912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5545195Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5545402Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5545783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5550524Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5550649Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5550775Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5550880Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5550884Z 2025-12-04T13:44:25.5551144Z [rank1]:[W1204 13:20:07.056124368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5551332Z [rank2]:[W1204 13:20:07.385429180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5551530Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5551811Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5551984Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5552382Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5552593Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5552724Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5552847Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5552953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5552957Z 2025-12-04T13:44:25.5553217Z [rank2]:[W1204 13:20:07.387618731 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5553424Z [rank3]:[W1204 13:20:07.392196959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5553626Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5553898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5554091Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5554505Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5554717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5554849Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5554963Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5555082Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5555084Z 2025-12-04T13:44:25.5555325Z [rank3]:[W1204 13:20:07.394143975 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5555519Z [rank1]:[W1204 13:20:08.056269711 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5555719Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5555994Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5556178Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5556559Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5556781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5556918Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5557028Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5557146Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5557148Z 2025-12-04T13:44:25.5557393Z [rank1]:[W1204 13:20:08.058339595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5557652Z [rank2]:[W1204 13:20:08.387720526 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5557830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5558121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5558296Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5558687Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5558946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5559056Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5559191Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5559298Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5559300Z 2025-12-04T13:44:25.5559560Z [rank2]:[W1204 13:20:08.390080993 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5559742Z [rank3]:[W1204 13:20:08.394266379 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5559935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5560228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5560399Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5560792Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5561005Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5561130Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5561267Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5561374Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5561376Z 2025-12-04T13:44:25.5561634Z [rank3]:[W1204 13:20:08.395977551 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5561815Z [rank1]:[W1204 13:20:09.058459469 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5562025Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5562297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5562494Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5562882Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5563113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5563236Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5563352Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5563481Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5563484Z 2025-12-04T13:44:25.5563727Z [rank1]:[W1204 13:20:09.059943156 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5563917Z [rank2]:[W1204 13:20:09.390200298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5564102Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5569333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5569505Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5569871Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5570078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5570186Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5570281Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5570380Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5570382Z 2025-12-04T13:44:25.5570616Z [rank2]:[W1204 13:20:09.392617704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5570787Z [rank3]:[W1204 13:20:09.396099566 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5570966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5571262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5571427Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5571794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5571995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5572140Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5572235Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5572332Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5572334Z 2025-12-04T13:44:25.5572564Z [rank3]:[W1204 13:20:09.398056612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5572737Z [rank1]:[W1204 13:20:10.060057852 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5572911Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5573171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5573332Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5573697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5573898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5574003Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5574100Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5574195Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5574197Z 2025-12-04T13:44:25.5574430Z [rank1]:[W1204 13:20:10.061688565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5574598Z [rank2]:[W1204 13:20:10.392755139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5574772Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5575048Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5575212Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5575577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5575777Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5575905Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5576000Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5576100Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5576102Z 2025-12-04T13:44:25.5576333Z [rank2]:[W1204 13:20:10.394879561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5576504Z [rank3]:[W1204 13:20:10.398151128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5576677Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5576932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5577096Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5577462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5577703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5577809Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5577905Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5578004Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5578005Z 2025-12-04T13:44:25.5578236Z [rank3]:[W1204 13:20:10.400269881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5578406Z [rank1]:[W1204 13:20:11.061842981 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5578578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5578833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5579021Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5579384Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5579586Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5579690Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5579811Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5579907Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5579908Z 2025-12-04T13:44:25.5580144Z [rank1]:[W1204 13:20:11.063080313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5580313Z [rank2]:[W1204 13:20:11.395047097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5580487Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5580742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5580904Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5581270Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5581470Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5581574Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5581669Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5581767Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5581769Z 2025-12-04T13:44:25.5582003Z [rank2]:[W1204 13:20:11.397221269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5582170Z [rank3]:[W1204 13:20:11.400407257 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5582345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5582599Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5582764Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5583146Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5583347Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5583451Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5583546Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5583662Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5583663Z 2025-12-04T13:44:25.5583897Z [rank3]:[W1204 13:20:11.402773034 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5584068Z [rank1]:[W1204 13:20:12.063203341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5584240Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5584494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5584657Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5585024Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5585224Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5585329Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5585424Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5585517Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5585521Z 2025-12-04T13:44:25.5585754Z [rank1]:[W1204 13:20:12.065561098 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5585924Z [rank2]:[W1204 13:20:12.397368826 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5586098Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5586353Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5586514Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5586907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5587107Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5587213Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5587307Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5587404Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5587425Z 2025-12-04T13:44:25.5587692Z [rank2]:[W1204 13:20:12.399800952 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5587863Z [rank3]:[W1204 13:20:12.402874643 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5588036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5588292Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5588455Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5588822Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5589023Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5589127Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5589221Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5589317Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5589318Z 2025-12-04T13:44:25.5589549Z [rank3]:[W1204 13:20:12.404907337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5589722Z [rank1]:[W1204 13:20:13.065685677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5589894Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5590148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5590310Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5590706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5590912Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5591016Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5591111Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5591205Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5591207Z 2025-12-04T13:44:25.5591438Z [rank1]:[W1204 13:20:13.067551885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5591636Z [rank2]:[W1204 13:20:13.399963510 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5591808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5592062Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5592224Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5592590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5592794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5592899Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5592993Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5593089Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5593091Z 2025-12-04T13:44:25.5593325Z [rank2]:[W1204 13:20:13.402633040 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5593494Z [rank3]:[W1204 13:20:13.405039456 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5593669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5593923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5594085Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5594451Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5594673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5594778Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5594872Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5594968Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5594970Z 2025-12-04T13:44:25.5595200Z [rank3]:[W1204 13:20:13.407031912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5595389Z [rank1]:[W1204 13:20:14.067703644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5595563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5595818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5595981Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5596349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5596553Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5596657Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5596752Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5596848Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5596849Z 2025-12-04T13:44:25.5597087Z [rank1]:[W1204 13:20:14.070060541 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5597257Z [rank2]:[W1204 13:20:14.402769930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5597431Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5597731Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5597891Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5598257Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5598459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5598586Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5598683Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5598779Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5598783Z 2025-12-04T13:44:25.5599017Z [rank2]:[W1204 13:20:14.405183146 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5599188Z [rank3]:[W1204 13:20:14.407162021 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5599385Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5599640Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5599802Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5600166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5600366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5600473Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5600569Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5600665Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5600667Z 2025-12-04T13:44:25.5600899Z [rank3]:[W1204 13:20:14.409255584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5601070Z [rank1]:[W1204 13:20:15.070206352 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5601246Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5601506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5601670Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5602034Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5602235Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5602341Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5602464Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5602560Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5602562Z 2025-12-04T13:44:25.5602793Z [rank1]:[W1204 13:20:15.071837715 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5602962Z [rank2]:[W1204 13:20:15.405301987 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5603134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5603412Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5603576Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5603942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5604142Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5604248Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5604345Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5604443Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5604445Z 2025-12-04T13:44:25.5604676Z [rank2]:[W1204 13:20:15.407651424 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5604843Z [rank3]:[W1204 13:20:15.409463584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5605016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5605271Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5605434Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5605801Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5606000Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5606104Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5606199Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5606316Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5606318Z 2025-12-04T13:44:25.5606550Z [rank3]:[W1204 13:20:15.410845253 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5606720Z [rank1]:[W1204 13:20:16.071977986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5606891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5607146Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5607332Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5607785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5607987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5608090Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5608188Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5608283Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5608285Z 2025-12-04T13:44:25.5608519Z [rank1]:[W1204 13:20:16.074017961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5608687Z [rank2]:[W1204 13:20:16.407775336 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5608858Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5609111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5609274Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5609643Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5609842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5609946Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5610043Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5610140Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5610142Z 2025-12-04T13:44:25.5610400Z [rank2]:[W1204 13:20:16.410159933 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5610569Z [rank3]:[W1204 13:20:16.410951425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5610741Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5610996Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5611184Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5611549Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5611748Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5611852Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5611947Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5612044Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5612046Z 2025-12-04T13:44:25.5612282Z [rank3]:[W1204 13:20:16.412696326 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5612451Z [rank1]:[W1204 13:20:17.074185523 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5612624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5612876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5613036Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5613402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5613602Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5613705Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5613802Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5613897Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5613900Z 2025-12-04T13:44:25.5614152Z [rank1]:[W1204 13:20:17.075506043 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5614323Z [rank2]:[W1204 13:20:17.410453783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5614496Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5614748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5614909Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5615300Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5615501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5615604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5615700Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5615795Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5615797Z 2025-12-04T13:44:25.5616030Z [rank2]:[W1204 13:20:17.412948017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5616200Z [rank3]:[W1204 13:20:17.412818710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5616374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5616630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5616793Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5617161Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5617359Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5617463Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5617592Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5617689Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5617690Z 2025-12-04T13:44:25.5617921Z [rank3]:[W1204 13:20:17.414292087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5618117Z [rank1]:[W1204 13:20:18.075689096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5618289Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5618542Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5618704Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5619069Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5619293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5619396Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5619491Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5619586Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5619590Z 2025-12-04T13:44:25.5619820Z [rank1]:[W1204 13:20:18.078859405 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5619993Z [rank2]:[W1204 13:20:18.413086360 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5620165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5620419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5620580Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5620949Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5621152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5621256Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5621351Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5621446Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5621448Z 2025-12-04T13:44:25.5621680Z [rank2]:[W1204 13:20:18.415085656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5621849Z [rank3]:[W1204 13:20:18.414423320 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5622041Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5622296Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5622458Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5622822Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5623044Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5623148Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5623242Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5623338Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5623339Z 2025-12-04T13:44:25.5623570Z [rank3]:[W1204 13:20:18.416356087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5623739Z [rank1]:[W1204 13:20:19.079047988 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5623915Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5624168Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5624329Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5624691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5624893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5624997Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5625096Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5625191Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5625194Z 2025-12-04T13:44:25.5625425Z [rank1]:[W1204 13:20:19.081392296 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5625593Z [rank2]:[W1204 13:20:19.415259690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5625767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5626039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5626200Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5626566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5626785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5626890Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5626985Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5627080Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5627082Z 2025-12-04T13:44:25.5627315Z [rank2]:[W1204 13:20:19.417424151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5627506Z [rank3]:[W1204 13:20:19.416492592 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5627684Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5627939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5628100Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5628467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5628666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5628772Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5628867Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5628963Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5628965Z 2025-12-04T13:44:25.5629196Z [rank3]:[W1204 13:20:19.418254423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5629366Z [rank1]:[W1204 13:20:20.081562070 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5629541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5629822Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5629984Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5630347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5630548Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5630683Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5630780Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5630876Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5630879Z 2025-12-04T13:44:25.5631112Z [rank1]:[W1204 13:20:20.082826362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5631286Z [rank2]:[W1204 13:20:20.417575457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5631462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5631718Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5631880Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5632244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5632446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5632551Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5632647Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5632743Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5632745Z 2025-12-04T13:44:25.5632978Z [rank2]:[W1204 13:20:20.418804629 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5633146Z [rank3]:[W1204 13:20:20.418382099 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5633320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5633576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5633758Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5634124Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5634323Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5634429Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5634542Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5634640Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5634642Z 2025-12-04T13:44:25.5634872Z [rank3]:[W1204 13:20:20.419628861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5635042Z [rank1]:[W1204 13:20:21.082950689 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5635215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5635471Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5635638Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5636002Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5636203Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5636306Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5636403Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5636499Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5636501Z 2025-12-04T13:44:25.5636733Z [rank1]:[W1204 13:20:21.084198211 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5636902Z [rank2]:[W1204 13:20:21.418949706 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5637074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5637328Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5637529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5637924Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5638127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5638231Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5638328Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5638449Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5638451Z 2025-12-04T13:44:25.5638685Z [rank2]:[W1204 13:20:21.420296006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5638854Z [rank3]:[W1204 13:20:21.419763778 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5639027Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5639279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5639444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5639810Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5640010Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5640115Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5640209Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5640307Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5640309Z 2025-12-04T13:44:25.5640541Z [rank3]:[W1204 13:20:21.421512239 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5640711Z [rank1]:[W1204 13:20:22.084349538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5640885Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5641136Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5641296Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5641680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5641883Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5641987Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5642082Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5642179Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5642199Z 2025-12-04T13:44:25.5642431Z [rank1]:[W1204 13:20:22.085613930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5642600Z [rank2]:[W1204 13:20:22.420448783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5642772Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5643025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5643185Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5643557Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5643758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5643861Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5643957Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5644053Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5644055Z 2025-12-04T13:44:25.5644291Z [rank2]:[W1204 13:20:22.422788261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5644460Z [rank3]:[W1204 13:20:22.421636267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5644635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5644891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5645054Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5645445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5645647Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5645753Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5645847Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5645945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5645947Z 2025-12-04T13:44:25.5646179Z [rank3]:[W1204 13:20:22.423286170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5646373Z [rank1]:[W1204 13:20:23.085753199 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5646548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5646799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5646965Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5647328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5647576Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5647680Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5647775Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5647871Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5647873Z 2025-12-04T13:44:25.5648103Z [rank1]:[W1204 13:20:23.087238355 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5648275Z [rank2]:[W1204 13:20:23.422893481 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5648450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5648707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5648868Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5649234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5649466Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5649571Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5649667Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5649764Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5649766Z 2025-12-04T13:44:25.5649997Z [rank2]:[W1204 13:20:23.424068684 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5650191Z [rank3]:[W1204 13:20:23.423385080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5650369Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5650626Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5650787Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5651151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5651352Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5651458Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5651552Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5651647Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5651649Z 2025-12-04T13:44:25.5651881Z [rank3]:[W1204 13:20:23.425307297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5652053Z [rank1]:[W1204 13:20:24.087440803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5652229Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5652482Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5652643Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5653009Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5653212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5653335Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5653430Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5653527Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5653529Z 2025-12-04T13:44:25.5653760Z [rank1]:[W1204 13:20:24.089150015 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5653933Z [rank2]:[W1204 13:20:24.424210814 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5654123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5654379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5654540Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5654907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5655110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5655214Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5655311Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5655406Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5655408Z 2025-12-04T13:44:25.5655640Z [rank2]:[W1204 13:20:24.426602201 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5655808Z [rank3]:[W1204 13:20:24.425427487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5655983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5656238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5656399Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5656762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5656962Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5657068Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5657182Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5657280Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5657282Z 2025-12-04T13:44:25.5657548Z [rank3]:[W1204 13:20:24.426753957 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5657717Z [rank1]:[W1204 13:20:25.089307625 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5657892Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5658170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5658332Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5658697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5658899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5659005Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5659101Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5659198Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5659200Z 2025-12-04T13:44:25.5659431Z [rank1]:[W1204 13:20:25.090551008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5659600Z [rank2]:[W1204 13:20:25.426747682 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5659773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5660027Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5660190Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5660555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5660756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5660861Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5660957Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5661085Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5661087Z 2025-12-04T13:44:25.5661319Z [rank2]:[W1204 13:20:25.427984464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5661488Z [rank3]:[W1204 13:20:25.426904838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5661663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5661917Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5662098Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5662465Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5662665Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5662768Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5662864Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5662960Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5662962Z 2025-12-04T13:44:25.5663195Z [rank3]:[W1204 13:20:25.429682316 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5663364Z [rank1]:[W1204 13:20:26.090701839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5663537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5663790Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5663956Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5664322Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5664523Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5664629Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5664725Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5664824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5664826Z 2025-12-04T13:44:25.5665079Z [rank1]:[W1204 13:20:26.091953121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5665249Z [rank2]:[W1204 13:20:26.428085847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5665421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5665679Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5665864Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5666233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5666436Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5666537Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5666633Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5666731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5666733Z 2025-12-04T13:44:25.5666968Z [rank2]:[W1204 13:20:26.430172430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5667136Z [rank3]:[W1204 13:20:26.429837368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5667311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5667586Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5667749Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5668119Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5668318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5668421Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5668516Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5668611Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5668615Z 2025-12-04T13:44:25.5668872Z [rank3]:[W1204 13:20:26.432093717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5669040Z [rank1]:[W1204 13:20:27.092116443 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5669213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5669465Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5669628Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5670018Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5670220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5670324Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5670419Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5670516Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5670519Z 2025-12-04T13:44:25.5670749Z [rank1]:[W1204 13:20:27.094207996 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5670919Z [rank2]:[W1204 13:20:27.430309283 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5671092Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5671344Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5671505Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5671872Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5672073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5672175Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5672273Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5672368Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5672370Z 2025-12-04T13:44:25.5672603Z [rank2]:[W1204 13:20:27.431549545 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5672790Z [rank3]:[W1204 13:20:27.432230790 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5672966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5673220Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5673380Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5673745Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5673968Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5674072Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5674165Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5674261Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5674262Z 2025-12-04T13:44:25.5674496Z [rank3]:[W1204 13:20:27.433393204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5674668Z [rank1]:[W1204 13:20:28.094437008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5674841Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5675092Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5675254Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5675615Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5675818Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5675924Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5676018Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5676115Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5676117Z 2025-12-04T13:44:25.5676348Z [rank1]:[W1204 13:20:28.096373874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5676521Z [rank2]:[W1204 13:20:28.431689069 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5676712Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5676969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5677132Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5677533Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5677768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5677870Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5677966Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5678062Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5678063Z 2025-12-04T13:44:25.5678295Z [rank2]:[W1204 13:20:28.432922902 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5678465Z [rank3]:[W1204 13:20:28.433515428 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5678639Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5678897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5679059Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5679423Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5679624Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5679729Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5679822Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5679918Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5679920Z 2025-12-04T13:44:25.5680152Z [rank3]:[W1204 13:20:28.435458915 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5680322Z [rank1]:[W1204 13:20:29.096532959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5680521Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5680775Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5680938Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5681303Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5681525Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5681630Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5681725Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5681821Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5681822Z 2025-12-04T13:44:25.5682053Z [rank1]:[W1204 13:20:29.097804670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5682223Z [rank2]:[W1204 13:20:29.433068806 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5682397Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5682652Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5682814Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5683180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5683383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5683486Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5683582Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5683677Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5683679Z 2025-12-04T13:44:25.5683912Z [rank2]:[W1204 13:20:29.434274509 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5684081Z [rank3]:[W1204 13:20:29.435789135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5684255Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5684529Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5684690Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5685055Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5685255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5685379Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5685476Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5685572Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5685573Z 2025-12-04T13:44:25.5685805Z [rank3]:[W1204 13:20:29.437326001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5685973Z [rank1]:[W1204 13:20:30.097995155 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5686148Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5686406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5686568Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5686935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5687138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5687243Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5687339Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5687436Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5687438Z 2025-12-04T13:44:25.5687703Z [rank1]:[W1204 13:20:30.099747585 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5687871Z [rank2]:[W1204 13:20:30.434408396 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5688044Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5688300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5688520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5688884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5689087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5689226Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5689322Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5689420Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5689422Z 2025-12-04T13:44:25.5689654Z [rank2]:[W1204 13:20:30.435873793 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5689823Z [rank3]:[W1204 13:20:30.437475837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5689995Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5690249Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5690412Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5690777Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5690976Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5691080Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5691178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5691274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5691277Z 2025-12-04T13:44:25.5691510Z [rank3]:[W1204 13:20:30.438925484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5691678Z [rank1]:[W1204 13:20:31.099927211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5691852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5692103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5692287Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5692654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5692853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5692958Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5693055Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5693171Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5693173Z 2025-12-04T13:44:25.5697129Z [rank1]:[W1204 13:20:31.101206952 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5697301Z [rank2]:[W1204 13:20:31.435996010 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5697582Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5697839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5698007Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5698374Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5698575Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5698679Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5698779Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5698876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5698878Z 2025-12-04T13:44:25.5699116Z [rank2]:[W1204 13:20:31.437836039 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5699286Z [rank3]:[W1204 13:20:31.439060821 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5699459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5699715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5699878Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5700272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5700474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5700576Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5700672Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5700769Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5700800Z 2025-12-04T13:44:25.5701035Z [rank3]:[W1204 13:20:31.441236113 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5701203Z [rank1]:[W1204 13:20:32.101368560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5701377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5701630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5701797Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5702173Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5702375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5702480Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5702576Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5702673Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5702676Z 2025-12-04T13:44:25.5702908Z [rank1]:[W1204 13:20:32.103440443 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5703080Z [rank2]:[W1204 13:20:32.437977737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5703252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5703506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5703669Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5704057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5704260Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5704364Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5704460Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5704556Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5704559Z 2025-12-04T13:44:25.5704789Z [rank2]:[W1204 13:20:32.439221259 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5704978Z [rank3]:[W1204 13:20:32.441363551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5705151Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5705406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5705567Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5705935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5706135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5706240Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5706336Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5706432Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5706433Z 2025-12-04T13:44:25.5706666Z [rank3]:[W1204 13:20:32.442527035 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5706841Z [rank1]:[W1204 13:20:33.103605541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5707015Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5707267Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5707432Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5707841Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5708074Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5708179Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5708273Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5708370Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5708372Z 2025-12-04T13:44:25.5708606Z [rank1]:[W1204 13:20:33.104846743 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5708804Z [rank2]:[W1204 13:20:33.439362418 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5708980Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5709233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5709396Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5709759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5709963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5710068Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5710164Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5710259Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5710262Z 2025-12-04T13:44:25.5710492Z [rank2]:[W1204 13:20:33.440852075 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5710663Z [rank3]:[W1204 13:20:33.442657734 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5710839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5711096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5711256Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5711620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5711840Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5711943Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5712040Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5712134Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5712136Z 2025-12-04T13:44:25.5712368Z [rank3]:[W1204 13:20:33.444338517 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5712536Z [rank1]:[W1204 13:20:34.105018512 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5712729Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5712985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5713149Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5713515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5713716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5713822Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5713916Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5714012Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5714014Z 2025-12-04T13:44:25.5714244Z [rank1]:[W1204 13:20:34.106901990 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5714413Z [rank2]:[W1204 13:20:34.440966695 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5714589Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5714842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5715005Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5715372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5715574Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5715699Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5715797Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5715893Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5715896Z 2025-12-04T13:44:25.5716125Z [rank2]:[W1204 13:20:34.443097877 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5716296Z [rank3]:[W1204 13:20:34.444440267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5716511Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5716770Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5716931Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5717297Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5717518Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5717624Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5717722Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5717817Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5717818Z 2025-12-04T13:44:25.5718050Z [rank3]:[W1204 13:20:34.446526561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5718218Z [rank1]:[W1204 13:20:35.107085940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5718392Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5718647Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5718808Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5719175Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5719374Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5719481Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5719575Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5719697Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5719699Z 2025-12-04T13:44:25.5719930Z [rank1]:[W1204 13:20:35.108945678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5720099Z [rank2]:[W1204 13:20:35.443206509 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5720274Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5720550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5720716Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5721080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5721282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5721384Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5721483Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5721581Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5721582Z 2025-12-04T13:44:25.5721815Z [rank2]:[W1204 13:20:35.445676894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5721985Z [rank3]:[W1204 13:20:35.446631502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5722157Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5722410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5722574Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5722939Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5723140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5723244Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5723341Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5723435Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5723461Z 2025-12-04T13:44:25.5723694Z [rank3]:[W1204 13:20:35.448638258 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5723864Z [rank1]:[W1204 13:20:36.109114929 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5724037Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5724288Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5724472Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5724838Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5725036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5725143Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5725238Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5725336Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5725337Z 2025-12-04T13:44:25.5725571Z [rank1]:[W1204 13:20:36.110811971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5725743Z [rank2]:[W1204 13:20:36.445847875 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5725918Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5726172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5726337Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5726704Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5726904Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5727008Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5727105Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5727204Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5727206Z 2025-12-04T13:44:25.5727456Z [rank2]:[W1204 13:20:36.447630035 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5727658Z [rank3]:[W1204 13:20:36.448804579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5727830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5728085Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5730600Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5730975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5731178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5731283Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5731380Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5731475Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5731516Z 2025-12-04T13:44:25.5731751Z [rank3]:[W1204 13:20:36.450026851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5731921Z [rank1]:[W1204 13:20:37.111014582 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5732097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5732351Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5732514Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5732884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5733086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5733191Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5733286Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5733383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5733387Z 2025-12-04T13:44:25.5733621Z [rank1]:[W1204 13:20:37.113687003 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5733823Z [rank2]:[W1204 13:20:37.447782718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5733997Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5734252Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5734414Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5734852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5735056Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5735160Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5735258Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5735354Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5735356Z 2025-12-04T13:44:25.5735589Z [rank2]:[W1204 13:20:37.449635826 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5735761Z [rank3]:[W1204 13:20:37.450174554 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5735935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5736188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5736353Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5736720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5736921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5737026Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5737122Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5737220Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5737222Z 2025-12-04T13:44:25.5737454Z [rank3]:[W1204 13:20:37.452284467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5737703Z [rank1]:[W1204 13:20:38.113861115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5737878Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5738131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5738294Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5738659Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5738905Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5739009Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5739105Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5739202Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5739203Z 2025-12-04T13:44:25.5739438Z [rank1]:[W1204 13:20:38.116246592 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5739612Z [rank2]:[W1204 13:20:38.449778530 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5739787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5740042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5740203Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5740567Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5740771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5740875Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5740972Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5741067Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5741069Z 2025-12-04T13:44:25.5741301Z [rank2]:[W1204 13:20:38.451967481 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5741472Z [rank3]:[W1204 13:20:38.452363452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5741668Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5741922Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5742085Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5742451Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5742676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5742781Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5742875Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5742971Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5742973Z 2025-12-04T13:44:25.5743204Z [rank3]:[W1204 13:20:38.454360768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5743374Z [rank1]:[W1204 13:20:39.116433676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5743550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5743807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5743969Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5744333Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5744535Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5744639Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5744735Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5744832Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5744834Z 2025-12-04T13:44:25.5745064Z [rank1]:[W1204 13:20:39.118353583 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5745233Z [rank2]:[W1204 13:20:39.452123706 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5745407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5745680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5745842Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5746213Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5746417Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5746543Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5746640Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5746736Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5746738Z 2025-12-04T13:44:25.5746971Z [rank2]:[W1204 13:20:39.454187079 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5747140Z [rank3]:[W1204 13:20:39.454566531 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5747316Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5747605Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5747766Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5748131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5748332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5748439Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5748535Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5748631Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5748633Z 2025-12-04T13:44:25.5748864Z [rank3]:[W1204 13:20:39.456171045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5749073Z [rank2]:W1204 13:20:39.592000 64068 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.5749275Z [rank0]:W1204 13:20:39.918000 64066 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.5749472Z [rank1]:[W1204 13:20:40.118478648 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5749647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5749902Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5750066Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5750434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5750662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5750767Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5750862Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5750958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5750960Z 2025-12-04T13:44:25.5751191Z [rank1]:[W1204 13:20:40.120746128 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5751364Z [rank2]:[W1204 13:20:40.454419833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5751539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5751794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5751957Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5752320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5752527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5752631Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5752730Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5752826Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5752827Z 2025-12-04T13:44:25.5753062Z [rank2]:[W1204 13:20:40.456402089 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5753234Z [rank3]:[W1204 13:20:40.456299311 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5753434Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5753690Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5753851Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5754216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5754441Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5754546Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5754642Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5754737Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5754738Z 2025-12-04T13:44:25.5754971Z [rank3]:[W1204 13:20:40.458263427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5755141Z [rank1]:[W1204 13:20:41.120879914 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5755320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5755574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5755735Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5756101Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5756302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5756408Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5756503Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5756598Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5756600Z 2025-12-04T13:44:25.5756831Z [rank1]:[W1204 13:20:41.122213535 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5757001Z [rank2]:[W1204 13:20:41.456551836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5757177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5757453Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5757669Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5758037Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5758239Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5758370Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5758466Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5758564Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5758565Z 2025-12-04T13:44:25.5758796Z [rank2]:[W1204 13:20:41.458325986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5758969Z [rank3]:[W1204 13:20:41.458409084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5759142Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5759400Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5759562Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5759933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5760136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5760242Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5760338Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5760433Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5760434Z 2025-12-04T13:44:25.5760667Z [rank3]:[W1204 13:20:41.460345961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5760836Z [rank1]:[W1204 13:20:42.122489939 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5761009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5761290Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5761451Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5761818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5762020Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5762147Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5762242Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5762340Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5762342Z 2025-12-04T13:44:25.5762576Z [rank1]:[W1204 13:20:42.124714349 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5762745Z [rank2]:[W1204 13:20:42.458441744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5762920Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5763175Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5763340Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5763705Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5763908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5764014Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5764112Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5764209Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5764211Z 2025-12-04T13:44:25.5764441Z [rank2]:[W1204 13:20:42.460358561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5764611Z [rank3]:[W1204 13:20:42.460445319 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5764783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5765039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5765221Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5765587Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5765789Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5765891Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5766013Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5766108Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5766109Z 2025-12-04T13:44:25.5766345Z [rank3]:[W1204 13:20:42.461887127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5766514Z [rank1]:[W1204 13:20:43.124895137 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5766691Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5766945Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5767109Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5767517Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5767717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5767822Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5767916Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5768014Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5768015Z 2025-12-04T13:44:25.5768250Z [rank1]:[W1204 13:20:43.126735506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5768419Z [rank2]:[W1204 13:20:43.460503380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5768592Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5768846Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5769013Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5769406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5769609Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5769713Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5769807Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5769932Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5769933Z 2025-12-04T13:44:25.5770166Z [rank2]:[W1204 13:20:43.462349299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5770335Z [rank3]:[W1204 13:20:43.461994947 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5770508Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5770764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5770926Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5771292Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5771493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5771595Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5771690Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5771784Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5771788Z 2025-12-04T13:44:25.5772022Z [rank3]:[W1204 13:20:43.463230259 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5772191Z [rank1]:[W1204 13:20:44.126917324 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5772365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5772621Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5772782Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5773168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5773368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5773472Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5773567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5773662Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5773677Z 2025-12-04T13:44:25.5773923Z [rank1]:[W1204 13:20:44.128156926 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5774093Z [rank2]:[W1204 13:20:44.462513348 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5774266Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5774519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5774685Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5775059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5775261Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5775364Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5775458Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5775556Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5775558Z 2025-12-04T13:44:25.5775728Z [rank3]:[W1204 13:20:44.463353669 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5775904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5776157Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5776319Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5776683Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5776884Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5777010Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5777104Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5777201Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5777203Z 2025-12-04T13:44:25.5777434Z [rank2]:[W1204 13:20:44.464788157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5777701Z [rank3]:[W1204 13:20:44.464788187 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5777898Z [rank1]:[W1204 13:20:45.128314857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5778071Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5778326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5778488Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5778853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5779055Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5779161Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5779256Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5779353Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5779355Z 2025-12-04T13:44:25.5779587Z [rank1]:[W1204 13:20:45.130168635 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5779757Z [rank2]:[W1204 13:20:45.465017216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5779931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5780185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5780347Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5780711Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5780939Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5781044Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5781140Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5781237Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5781239Z 2025-12-04T13:44:25.5781410Z [rank3]:[W1204 13:20:45.465017226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5781582Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5781863Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5782026Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5782391Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5782591Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5782697Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5782793Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5782889Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5782891Z 2025-12-04T13:44:25.5783122Z [rank2]:[W1204 13:20:45.466914633 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5783350Z [rank3]:[W1204 13:20:45.466916763 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5783523Z [rank1]:[W1204 13:20:46.130361445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5783698Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5783955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5784115Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5784480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5784681Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5784804Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5784901Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5784996Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5784998Z 2025-12-04T13:44:25.5785230Z [rank1]:[W1204 13:20:46.132679123 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5785399Z [rank3]:[W1204 13:20:46.467034816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5785594Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5785852Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5786017Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5786380Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5786581Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5786688Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5786783Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5786881Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5786883Z 2025-12-04T13:44:25.5787114Z [rank3]:[W1204 13:20:46.468692609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5787283Z [rank2]:[W1204 13:20:46.467030726 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5787455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5787760Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5787924Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5788290Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5788489Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5788595Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5788722Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5788818Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5788820Z 2025-12-04T13:44:25.5789051Z [rank2]:[W1204 13:20:46.468931413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5789222Z [rank1]:[W1204 13:20:47.132839045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5789394Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5789675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5789835Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5790202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5790401Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5790508Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5790603Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5790699Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5790701Z 2025-12-04T13:44:25.5790934Z [rank1]:[W1204 13:20:47.134684394 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5791102Z [rank3]:[W1204 13:20:47.468829191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5791279Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5791533Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5791699Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5792064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5792264Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5792369Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5792468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5792584Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5792586Z 2025-12-04T13:44:25.5792816Z [rank3]:[W1204 13:20:47.470067234 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5792986Z [rank2]:[W1204 13:20:47.469816489 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5793158Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5793411Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5793595Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5793960Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5794161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5794265Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5794363Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5794459Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5794461Z 2025-12-04T13:44:25.5794695Z [rank2]:[W1204 13:20:47.472211346 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5794867Z [rank1]:[W1204 13:20:48.134839197 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5795040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5795295Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5795457Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5795823Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5796022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5796126Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5796221Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5796319Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5796320Z 2025-12-04T13:44:25.5796573Z [rank1]:[W1204 13:20:48.136829013 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5796743Z [rank3]:[W1204 13:20:48.470220857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5796918Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5797171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5797364Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5797774Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5797974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5798079Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5798173Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5798270Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5798273Z 2025-12-04T13:44:25.5798506Z [rank3]:[W1204 13:20:48.471668845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5798676Z [rank2]:[W1204 13:20:48.472327120 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5798849Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5799104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5799267Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5799635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5799838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5799941Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5800037Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5800133Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5800136Z 2025-12-04T13:44:25.5800394Z [rank2]:[W1204 13:20:48.474232147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5800565Z [rank1]:[W1204 13:20:49.136969387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5800738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5800994Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5801156Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5801547Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5801746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5801851Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5801946Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5802042Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5802044Z 2025-12-04T13:44:25.5802278Z [rank1]:[W1204 13:20:49.138534822 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5802448Z [rank3]:[W1204 13:20:49.471970356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5802622Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5802879Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5803040Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5803411Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5803611Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5803716Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5803810Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5803907Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5803909Z 2025-12-04T13:44:25.5804140Z [rank3]:[W1204 13:20:49.473330715 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5804333Z [rank2]:[W1204 13:20:49.474326463 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5804506Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5804763Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5804926Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5805294Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5805517Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5805620Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5805716Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5805812Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5805813Z 2025-12-04T13:44:25.5806046Z [rank2]:[W1204 13:20:49.476043745 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5806220Z [rank1]:[W1204 13:20:50.138658918 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5806395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5806649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5806810Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5807178Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5807382Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5807529Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5807626Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5807722Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5807724Z 2025-12-04T13:44:25.5807956Z [rank1]:[W1204 13:20:50.140973546 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5808126Z [rank3]:[W1204 13:20:50.473455921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5808326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5808580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5808742Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5809109Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5809339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5809447Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5809543Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5809641Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5809643Z 2025-12-04T13:44:25.5809873Z [rank3]:[W1204 13:20:50.474840230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5810043Z [rank2]:[W1204 13:20:50.476138211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5810220Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5810472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5810635Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5811000Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5811203Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5811307Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5811403Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5811499Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5811502Z 2025-12-04T13:44:25.5811734Z [rank2]:[W1204 13:20:50.477837243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5811905Z [rank1]:[W1204 13:20:51.141093483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5812078Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5812360Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5812522Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5812888Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5813107Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5813212Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5813308Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5813403Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5813405Z 2025-12-04T13:44:25.5813638Z [rank1]:[W1204 13:20:51.142426683 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5813808Z [rank3]:[W1204 13:20:51.475010396 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5813984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5814240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5814401Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5814765Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5814967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5815073Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5815168Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5815265Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5815267Z 2025-12-04T13:44:25.5815499Z [rank3]:[W1204 13:20:51.476608190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5815668Z [rank2]:[W1204 13:20:51.477945361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5815842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5816118Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5816280Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5816644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5816844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5816966Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5817064Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5817160Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5817164Z 2025-12-04T13:44:25.5817395Z [rank2]:[W1204 13:20:51.479454177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5817600Z [rank1]:[W1204 13:20:52.142599829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5817773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5818032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5818193Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5818558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5818758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5818864Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5818959Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5819055Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5819057Z 2025-12-04T13:44:25.5824077Z [rank1]:[W1204 13:20:52.144409339 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5824262Z [rank3]:[W1204 13:20:52.476752008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5824440Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5824698Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5824909Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5825279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5825480Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5825586Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5825710Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5825809Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5825812Z 2025-12-04T13:44:25.5826045Z [rank3]:[W1204 13:20:52.477983590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5826217Z [rank2]:[W1204 13:20:52.479573115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5826392Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5826646Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5826814Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5827180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5827383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5827528Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5827628Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5827724Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5827727Z 2025-12-04T13:44:25.5827962Z [rank2]:[W1204 13:20:52.481738406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5828132Z [rank1]:[W1204 13:20:53.144605576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5828305Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5828560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5828722Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5829117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5829319Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5829423Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5829520Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5829648Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5829650Z 2025-12-04T13:44:25.5829884Z [rank1]:[W1204 13:20:53.145966665 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5830053Z [rank3]:[W1204 13:20:53.478083820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5830228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5830482Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5830648Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5831015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5831218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5831323Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5831417Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5831513Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5831517Z 2025-12-04T13:44:25.5831750Z [rank3]:[W1204 13:20:53.479288653 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5831921Z [rank2]:[W1204 13:20:53.481820726 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5832096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5832349Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5832511Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5832898Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5833100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5833204Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5833302Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5833398Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5833421Z 2025-12-04T13:44:25.5833653Z [rank2]:[W1204 13:20:53.483200115 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5833824Z [rank1]:[W1204 13:20:54.146128964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5833997Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5834251Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5834412Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5834780Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5834980Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5835083Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5835178Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5835273Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5835275Z 2025-12-04T13:44:25.5835512Z [rank1]:[W1204 13:20:54.148220377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5835681Z [rank3]:[W1204 13:20:54.479443922 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5835856Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5836108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5836270Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5836657Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5836858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5836962Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5837055Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5837151Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5837153Z 2025-12-04T13:44:25.5837384Z [rank3]:[W1204 13:20:54.481681312 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5837613Z [rank2]:[W1204 13:20:54.483300186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5837788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5838041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5838203Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5838568Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5838774Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5838878Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5838973Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5839069Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5839072Z 2025-12-04T13:44:25.5839304Z [rank2]:[W1204 13:20:54.484965918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5839475Z [rank1]:[W1204 13:20:55.148336578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5839649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5839904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5840064Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5840427Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5840657Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5840761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5840857Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5840952Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5840954Z 2025-12-04T13:44:25.5841186Z [rank1]:[W1204 13:20:55.150086039 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5841379Z [rank3]:[W1204 13:20:55.481849342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5841554Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5841808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5841971Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5842338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5842541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5842646Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5842740Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5842835Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5842837Z 2025-12-04T13:44:25.5843067Z [rank3]:[W1204 13:20:55.483644371 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5843236Z [rank2]:[W1204 13:20:55.485044990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5843412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5843664Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5843828Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5844193Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5844395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5844517Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5844615Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5844712Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5844713Z 2025-12-04T13:44:25.5844943Z [rank2]:[W1204 13:20:55.486193205 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5845112Z [rank1]:[W1204 13:20:56.150233039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5845311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5845566Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5845728Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5846094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5846298Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5846403Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5846499Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5846594Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5846596Z 2025-12-04T13:44:25.5846829Z [rank1]:[W1204 13:20:56.151987070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5846997Z [rank3]:[W1204 13:20:56.483793273 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5847173Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5847427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5847635Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5848002Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5848201Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5848307Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5848425Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5848523Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5848525Z 2025-12-04T13:44:25.5848758Z [rank3]:[W1204 13:20:56.485034755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5848927Z [rank2]:[W1204 13:20:56.486270767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5849099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5849382Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5849545Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5849908Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5850109Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5850214Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5850311Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5850407Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5850409Z 2025-12-04T13:44:25.5850642Z [rank2]:[W1204 13:20:56.488587765 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5850811Z [rank1]:[W1204 13:20:57.152089903 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5850983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5851238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5851400Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5851767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5851968Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5852073Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5852168Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5852282Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5852284Z 2025-12-04T13:44:25.5852516Z [rank1]:[W1204 13:20:57.153329455 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5852683Z [rank3]:[W1204 13:20:57.485153668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5852858Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5853115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5853297Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5853662Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5853862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5853966Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5854062Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5854158Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5854161Z 2025-12-04T13:44:25.5854392Z [rank3]:[W1204 13:20:57.486690203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5854560Z [rank2]:[W1204 13:20:57.488664539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5854733Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5854988Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5855154Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5855522Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5855722Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5855827Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5855922Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5856021Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5856023Z 2025-12-04T13:44:25.5856274Z [rank2]:[W1204 13:20:57.490214744 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5856443Z [rank1]:[W1204 13:20:58.153491068 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5856615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5856868Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5857051Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5857421Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5857695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5857799Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5857895Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5857992Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5857994Z 2025-12-04T13:44:25.5858227Z [rank1]:[W1204 13:20:58.154737060 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5858396Z [rank3]:[W1204 13:20:58.486812227 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5858570Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5858825Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5858987Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5859356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5859557Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5859661Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5859755Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5859850Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5859854Z 2025-12-04T13:44:25.5860116Z [rank3]:[W1204 13:20:58.488406781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5860286Z [rank2]:[W1204 13:20:58.490287119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5860460Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5860712Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5860874Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5861264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5861469Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5861573Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5861668Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5861766Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5861769Z 2025-12-04T13:44:25.5862001Z [rank2]:[W1204 13:20:58.491961242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5862174Z [rank1]:[W1204 13:20:59.154919833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5862347Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5862601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5862762Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5863133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5863333Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5863436Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5863532Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5863627Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5863628Z 2025-12-04T13:44:25.5863861Z [rank1]:[W1204 13:20:59.156293362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5864052Z [rank3]:[W1204 13:20:59.488614964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5864229Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5864489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5864650Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5865017Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5865241Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5865345Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5865439Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5865536Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5865537Z 2025-12-04T13:44:25.5865769Z [rank3]:[W1204 13:20:59.490412394 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5865941Z [rank2]:[W1204 13:20:59.492058447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5866115Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5866369Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5866533Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5866897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5867101Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5867205Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5867300Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5867397Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5867399Z 2025-12-04T13:44:25.5867680Z [rank2]:[W1204 13:20:59.493545404 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5867852Z [rank1]:[W1204 13:21:00.156467096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5868053Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5868307Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5868468Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5868837Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5869065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5869168Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5869262Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5869357Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5869359Z 2025-12-04T13:44:25.5869592Z [rank1]:[W1204 13:21:00.158730726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5869762Z [rank3]:[W1204 13:21:00.490588838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5869936Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5870191Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5870352Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5870716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5870921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5871025Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5871119Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5871215Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5871217Z 2025-12-04T13:44:25.5871449Z [rank3]:[W1204 13:21:00.491826070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5871616Z [rank2]:[W1204 13:21:00.493682289 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5871791Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5872065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5872226Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5872589Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5872810Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5872915Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5873010Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5873107Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5873108Z 2025-12-04T13:44:25.5873339Z [rank2]:[W1204 13:21:00.495461979 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5873507Z [rank1]:[W1204 13:21:01.158903271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5873682Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5873936Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5874096Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5874459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5874662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5874766Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5874863Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5874958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5874960Z 2025-12-04T13:44:25.5875195Z [rank1]:[W1204 13:21:01.160273670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5875363Z [rank3]:[W1204 13:21:01.492025195 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5875538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5875816Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5875978Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5876341Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5876540Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5876666Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5876762Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5876858Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5876860Z 2025-12-04T13:44:25.5877092Z [rank3]:[W1204 13:21:01.493885214 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5877261Z [rank2]:[W1204 13:21:01.495715233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5877435Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5877729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5877891Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5878254Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5878456Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5878564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5878659Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5878758Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5878760Z 2025-12-04T13:44:25.5878991Z [rank2]:[W1204 13:21:01.498122139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5879160Z [rank1]:[W1204 13:21:02.160442006 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5879333Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5879592Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5879780Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5880144Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5880346Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5880462Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5880571Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5880668Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5880670Z 2025-12-04T13:44:25.5880902Z [rank1]:[W1204 13:21:02.161742467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5881069Z [rank3]:[W1204 13:21:02.494067430 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5881243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5881498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5881663Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5882031Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5882230Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5882334Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5882430Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5882526Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5882529Z 2025-12-04T13:44:25.5882761Z [rank3]:[W1204 13:21:02.496301110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5882930Z [rank2]:[W1204 13:21:02.498215797 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5883105Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5883359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5883542Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5883914Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5884118Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5884221Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5884316Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5884435Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5884436Z 2025-12-04T13:44:25.5884668Z [rank2]:[W1204 13:21:02.499369451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5884838Z [rank1]:[W1204 13:21:03.161923934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5885010Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5885264Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5885428Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5885792Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5885996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5886101Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5886197Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5886294Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5886296Z 2025-12-04T13:44:25.5886530Z [rank1]:[W1204 13:21:03.163646965 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5886700Z [rank3]:[W1204 13:21:03.496487027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5886873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5887127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5887290Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5887718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5887918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5888022Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5888117Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5888214Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5888245Z 2025-12-04T13:44:25.5888481Z [rank3]:[W1204 13:21:03.497809467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5888650Z [rank2]:[W1204 13:21:03.499468660 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5888822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5889075Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5889238Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5889605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5889805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5889910Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5890005Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5890103Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5890107Z 2025-12-04T13:44:25.5890340Z [rank2]:[W1204 13:21:03.501429356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5890515Z [rank1]:[W1204 13:21:04.163814803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5890687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5890944Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5891108Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5891492Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5891695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5891800Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5891895Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5891990Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5891991Z 2025-12-04T13:44:25.5892224Z [rank1]:[W1204 13:21:04.165086055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5892414Z [rank3]:[W1204 13:21:04.497991465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5892588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5892843Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5893004Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5893369Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5893570Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5893675Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5893769Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5893866Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5893868Z 2025-12-04T13:44:25.5894102Z [rank3]:[W1204 13:21:04.499252757 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5894273Z [rank2]:[W1204 13:21:04.501542666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5894448Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5894701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5894865Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5895230Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5895453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5895558Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5895653Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5895749Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5895751Z 2025-12-04T13:44:25.5895982Z [rank2]:[W1204 13:21:04.503229178 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5896171Z [rank1]:[W1204 13:21:05.165221864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5896345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5896600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5896761Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5897127Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5897332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5897435Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5897568Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5897663Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5897665Z 2025-12-04T13:44:25.5897898Z [rank1]:[W1204 13:21:05.166591024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5898069Z [rank3]:[W1204 13:21:05.499388277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5898244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5898500Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5898661Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5899025Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5899256Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5899363Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5899459Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5899554Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5899556Z 2025-12-04T13:44:25.5899787Z [rank3]:[W1204 13:21:05.501393572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5899955Z [rank2]:[W1204 13:21:05.503313859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5900154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5900408Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5900571Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5900939Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5901141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5901246Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5901341Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5901439Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5901440Z 2025-12-04T13:44:25.5901671Z [rank2]:[W1204 13:21:05.505554479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5901840Z [rank1]:[W1204 13:21:06.166735344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5902014Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5902270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5902433Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5902797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5903000Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5903124Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5903219Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5903314Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5903316Z 2025-12-04T13:44:25.5903550Z [rank1]:[W1204 13:21:06.167996766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5903721Z [rank3]:[W1204 13:21:06.501606031 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5903911Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5904179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5904342Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5904708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5904908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5905014Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5905111Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5905207Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5905209Z 2025-12-04T13:44:25.5905441Z [rank3]:[W1204 13:21:06.503141317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5905611Z [rank2]:[W1204 13:21:06.505671330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5905786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5906041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5906205Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5906571Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5906773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5906878Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5906972Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5907091Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5907093Z 2025-12-04T13:44:25.5907324Z [rank2]:[W1204 13:21:06.508092156 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5907527Z [rank1]:[W1204 13:21:07.168158257 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5907700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5907984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5908149Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5908514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5908715Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5908819Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5908916Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5909014Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5909016Z 2025-12-04T13:44:25.5909248Z [rank1]:[W1204 13:21:07.169386899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5909418Z [rank3]:[W1204 13:21:07.503345147 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5909591Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5909846Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5910013Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5910377Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5910576Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5910681Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5910778Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5910873Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5910899Z 2025-12-04T13:44:25.5911131Z [rank3]:[W1204 13:21:07.505443270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5911299Z [rank2]:[W1204 13:21:07.508209109 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5911472Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5911725Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5911908Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5912277Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5912478Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5912582Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5912677Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5912776Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5912778Z 2025-12-04T13:44:25.5913010Z [rank2]:[W1204 13:21:07.509935790 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5913180Z [rank1]:[W1204 13:21:08.169516362 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5913354Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5913608Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5913771Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5914137Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5914337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5914441Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5914536Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5914633Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5914635Z 2025-12-04T13:44:25.5914887Z [rank1]:[W1204 13:21:08.170734655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5915057Z [rank3]:[W1204 13:21:08.505622992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5915230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5915484Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5915658Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5916035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5916235Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5916338Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5916434Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5916530Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5916533Z 2025-12-04T13:44:25.5916767Z [rank3]:[W1204 13:21:08.506947513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5916935Z [rank2]:[W1204 13:21:08.510052953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5917108Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5917360Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5917565Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5917933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5918133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5918237Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5918331Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5918428Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5918432Z 2025-12-04T13:44:25.5918666Z [rank2]:[W1204 13:21:08.511712726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5918872Z [rank1]:[W1204 13:21:09.170881048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5919045Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5919299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5919461Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5919851Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5920053Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5920155Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5920251Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5920347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5920350Z 2025-12-04T13:44:25.5920581Z [rank1]:[W1204 13:21:09.172258017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5920755Z [rank3]:[W1204 13:21:09.507159965 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5920929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5921184Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5921345Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5921711Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5921913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5922017Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5922113Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5922208Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5922210Z 2025-12-04T13:44:25.5922442Z [rank3]:[W1204 13:21:09.509338986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5922632Z [rank2]:[W1204 13:21:09.511820701 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5922808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5923063Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5923227Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5923592Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5923815Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5923920Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5924014Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5924110Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5924112Z 2025-12-04T13:44:25.5924342Z [rank2]:[W1204 13:21:09.513782267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5924514Z [rank1]:[W1204 13:21:10.172421941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5924690Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5924942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5925104Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5925472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5925675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5925779Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5925874Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5925968Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5925971Z 2025-12-04T13:44:25.5926202Z [rank1]:[W1204 13:21:10.173720272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5926372Z [rank3]:[W1204 13:21:10.509514020 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5926567Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5926823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5926984Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5927350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5927606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5927710Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5927805Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5927900Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5927902Z 2025-12-04T13:44:25.5928134Z [rank3]:[W1204 13:21:10.511236171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5928303Z [rank2]:[W1204 13:21:10.513852703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5928478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5928732Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5928895Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5929261Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5929463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5929569Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5929664Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5929761Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5929763Z 2025-12-04T13:44:25.5929994Z [rank2]:[W1204 13:21:10.516151082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5930162Z [rank1]:[W1204 13:21:11.173839538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5930338Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5930615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5930778Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5931141Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5931340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5931470Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5931567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5931663Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5931666Z 2025-12-04T13:44:25.5931901Z [rank1]:[W1204 13:21:11.175063310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5932071Z [rank3]:[W1204 13:21:11.511399976 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5932244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5932500Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5932662Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5933026Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5933226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5933331Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5933427Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5933523Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5933525Z 2025-12-04T13:44:25.5933759Z [rank3]:[W1204 13:21:11.513299574 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5933929Z [rank2]:[W1204 13:21:11.516269408 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5934102Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5934384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5934548Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5934912Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5935112Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5935242Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5935336Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5935435Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5935437Z 2025-12-04T13:44:25.5935668Z [rank2]:[W1204 13:21:11.517826983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5935836Z [rank1]:[W1204 13:21:12.175196477 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5936011Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5936267Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5936430Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5936793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5936994Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5937097Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5937194Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5937292Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5937294Z 2025-12-04T13:44:25.5937544Z [rank1]:[W1204 13:21:12.176920678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5937712Z [rank3]:[W1204 13:21:12.513609476 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5937884Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5938139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5938331Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5938697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5938897Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5939000Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5939121Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5939217Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5939219Z 2025-12-04T13:44:25.5939453Z [rank3]:[W1204 13:21:12.515834947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5939621Z [rank2]:[W1204 13:21:12.517928570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5939794Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5940047Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5940211Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5940582Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5940781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5940886Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5940980Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5941078Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5941080Z 2025-12-04T13:44:25.5941312Z [rank2]:[W1204 13:21:12.521121539 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5941482Z [rank1]:[W1204 13:21:13.177101294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5941656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5941908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5942074Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5942459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5942661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5942765Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5942861Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5942977Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5942979Z 2025-12-04T13:44:25.5943211Z [rank1]:[W1204 13:21:13.178371266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5943382Z [rank3]:[W1204 13:21:13.515990204 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5943554Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5943807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5943967Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5944335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5944535Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5944639Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5944734Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5944829Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5944833Z 2025-12-04T13:44:25.5945068Z [rank3]:[W1204 13:21:13.518293962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5945236Z [rank2]:[W1204 13:21:13.521198597 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5945411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5945662Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5945824Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5946213Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5946414Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5946518Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5946612Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5946708Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5946722Z 2025-12-04T13:44:25.5946974Z [rank2]:[W1204 13:21:13.523213692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5947147Z [rank1]:[W1204 13:21:14.178640861 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5947320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5947616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5947777Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5948144Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5948344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5948448Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5948543Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5948641Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5948643Z 2025-12-04T13:44:25.5948874Z [rank1]:[W1204 13:21:14.179999270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5949048Z [rank3]:[W1204 13:21:14.518437090 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5949223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5949479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5949640Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5950037Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5950244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5950348Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5950444Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5950540Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5950542Z 2025-12-04T13:44:25.5950774Z [rank3]:[W1204 13:21:14.520705220 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5950967Z [rank2]:[W1204 13:21:14.523316821 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5951142Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5951396Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5951558Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5951924Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5952125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5952229Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5952325Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5952421Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5952423Z 2025-12-04T13:44:25.5952654Z [rank2]:[W1204 13:21:14.525730647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5952827Z [rank1]:[W1204 13:21:15.180159049 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5953003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5953255Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5953418Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5953788Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5954008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5954114Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5954208Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5954305Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5954306Z 2025-12-04T13:44:25.5954537Z [rank1]:[W1204 13:21:15.181400931 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5954718Z [rank3]:[W1204 13:21:15.520910337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5954902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5955156Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5955317Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5955686Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5955889Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5955992Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5956088Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5956183Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5956184Z 2025-12-04T13:44:25.5956416Z [rank3]:[W1204 13:21:15.522511561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5956585Z [rank2]:[W1204 13:21:15.525865116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5956761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5957016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5957178Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5957719Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5957924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5958058Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5958154Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5958252Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5958254Z 2025-12-04T13:44:25.5958485Z [rank2]:[W1204 13:21:15.527768114 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5958654Z [rank1]:[W1204 13:21:16.181608589 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5958855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5959109Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5959271Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5959634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5959839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5959945Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5960041Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5960140Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5960142Z 2025-12-04T13:44:25.5960373Z [rank1]:[W1204 13:21:16.183415519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5960543Z [rank3]:[W1204 13:21:16.522717290 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5960715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5960974Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5961135Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5961501Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5961701Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5961807Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5961923Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5962020Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5962021Z 2025-12-04T13:44:25.5962255Z [rank3]:[W1204 13:21:16.524739025 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5962425Z [rank2]:[W1204 13:21:16.527876475 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5962600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5962882Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5963044Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5963409Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5963610Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5963717Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5963811Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5963909Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5963911Z 2025-12-04T13:44:25.5964143Z [rank2]:[W1204 13:21:16.530154354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5964311Z [rank1]:[W1204 13:21:17.183585989 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5964485Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5964739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5964905Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5965268Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5965468Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5965573Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5965669Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5965784Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5965786Z 2025-12-04T13:44:25.5966016Z [rank1]:[W1204 13:21:17.184833471 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5966186Z [rank3]:[W1204 13:21:17.524918495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5966360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5966616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5966805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5967170Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5967370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5967518Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5967617Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5967712Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5967714Z 2025-12-04T13:44:25.5967947Z [rank3]:[W1204 13:21:17.527314341 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5968117Z [rank2]:[W1204 13:21:17.530226726 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5968291Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5968546Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5968710Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5969079Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5969280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5969384Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5969479Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5969578Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5969580Z 2025-12-04T13:44:25.5969839Z [rank2]:[W1204 13:21:17.532272951 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5970009Z [rank1]:[W1204 13:21:18.184999032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5970183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5970437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5970626Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5970993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5971196Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5971301Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5971394Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5971493Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5971496Z 2025-12-04T13:44:25.5971730Z [rank1]:[W1204 13:21:18.186599076 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5971901Z [rank3]:[W1204 13:21:18.527493252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5972076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5972333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5972495Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5972863Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5973065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5973169Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5973265Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5973362Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5973366Z 2025-12-04T13:44:25.5973620Z [rank3]:[W1204 13:21:18.529771941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5973791Z [rank2]:[W1204 13:21:18.532385803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5973965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5974219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5974382Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5974773Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5974973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5975077Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5975171Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5975268Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5975270Z 2025-12-04T13:44:25.5975506Z [rank2]:[W1204 13:21:18.534567564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5975677Z [rank1]:[W1204 13:21:19.186744718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5975851Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5976103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5976265Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5976635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5976836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5976941Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5977035Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5977133Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5977134Z 2025-12-04T13:44:25.5977366Z [rank1]:[W1204 13:21:19.188070159 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5977615Z [rank3]:[W1204 13:21:19.529920824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5977789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5978043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5978207Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5978576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5978807Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5978910Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5979007Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5979102Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5979104Z 2025-12-04T13:44:25.5979338Z [rank3]:[W1204 13:21:19.531977718 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5979509Z [rank2]:[W1204 13:21:19.534651578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5979684Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5979941Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5980104Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5980470Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5980673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5980778Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5980874Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5980970Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5980972Z 2025-12-04T13:44:25.5981207Z [rank2]:[W1204 13:21:19.535988148 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5981378Z [rank1]:[W1204 13:21:20.188237111 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5981572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5981824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5981988Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5982355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5982579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5982684Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5982779Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5982876Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5982878Z 2025-12-04T13:44:25.5983110Z [rank1]:[W1204 13:21:20.189468284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5983280Z [rank3]:[W1204 13:21:20.532142821 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5983457Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5983713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5983875Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5984242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5984446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5984550Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5984647Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5984741Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5984743Z 2025-12-04T13:44:25.5984976Z [rank3]:[W1204 13:21:20.534067238 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5985146Z [rank2]:[W1204 13:21:20.536087723 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5985320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5985596Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5985758Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5986124Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5986344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5986451Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5986546Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5986643Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5986645Z 2025-12-04T13:44:25.5986879Z [rank2]:[W1204 13:21:20.538388192 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5987048Z [rank1]:[W1204 13:21:21.189645127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5987226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5987519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5987682Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5988047Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5988248Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5988354Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5988450Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5988548Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5988549Z 2025-12-04T13:44:25.5988786Z [rank1]:[W1204 13:21:21.190888920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5988955Z [rank3]:[W1204 13:21:21.534236302 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5989127Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5989409Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5989574Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5989937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5990138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5990272Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5990370Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5990465Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5990467Z 2025-12-04T13:44:25.5990701Z [rank3]:[W1204 13:21:21.536851384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5990872Z [rank2]:[W1204 13:21:21.538480787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5991045Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5991301Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5991463Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5991832Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5992031Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5992138Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5992235Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5992331Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5992333Z 2025-12-04T13:44:25.5992568Z [rank2]:[W1204 13:21:21.540411814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5992736Z [rank1]:[W1204 13:21:22.191047324 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5992912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5993165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5993358Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5993724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5993925Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5994030Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5994147Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5994246Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5994247Z 2025-12-04T13:44:25.5994480Z [rank1]:[W1204 13:21:22.192292267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5994650Z [rank3]:[W1204 13:21:22.537023258 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5994823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5995081Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5995248Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5995612Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5995812Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5995915Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5996012Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5996108Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5996109Z 2025-12-04T13:44:25.5996346Z [rank3]:[W1204 13:21:22.539512413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5996518Z [rank2]:[W1204 13:21:22.540506721 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5996691Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5996945Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5997109Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5997542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5997743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5997847Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5997943Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5998067Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5998069Z 2025-12-04T13:44:25.5998303Z [rank2]:[W1204 13:21:22.542032197 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.5998473Z [rank1]:[W1204 13:21:23.192452642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.5998649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.5998903Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.5999068Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5999437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.5999639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.5999744Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.5999838Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5999935Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.5999939Z 2025-12-04T13:44:25.6000173Z [rank1]:[W1204 13:21:23.194225653 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6000343Z [rank3]:[W1204 13:21:23.539722398 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6000516Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6000773Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6000937Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6001324Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6001530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6001633Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6001728Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6001823Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6001844Z 2025-12-04T13:44:25.6002079Z [rank3]:[W1204 13:21:23.541382701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6002250Z [rank2]:[W1204 13:21:23.542151653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6002424Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6002680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6002842Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6003212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6003413Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6003518Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6003614Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6003710Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6003712Z 2025-12-04T13:44:25.6003945Z [rank2]:[W1204 13:21:23.544019462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6004116Z [rank1]:[W1204 13:21:24.194375849 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6004291Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6004547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6004709Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6005094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6005296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6005400Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6005494Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6005592Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6005594Z 2025-12-04T13:44:25.6005826Z [rank1]:[W1204 13:21:24.195621802 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6006024Z [rank3]:[W1204 13:21:24.541544337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6006198Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6006453Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6006615Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6006981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6007186Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6007289Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6007386Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6007517Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6007519Z 2025-12-04T13:44:25.6007752Z [rank3]:[W1204 13:21:24.543762108 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6007925Z [rank2]:[W1204 13:21:24.544308376 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6008099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6008352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6008515Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6008884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6009112Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6009219Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6009317Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6009413Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6009414Z 2025-12-04T13:44:25.6009649Z [rank2]:[W1204 13:21:24.546947507 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6009846Z [rank1]:[W1204 13:21:25.195754280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6010021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6010275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6010438Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6010808Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6011012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6011117Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6011212Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6011308Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6011310Z 2025-12-04T13:44:25.6011541Z [rank1]:[W1204 13:21:25.197084510 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6011711Z [rank3]:[W1204 13:21:25.543959175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6011887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6012142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6012304Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6012669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6012871Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6012998Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6013096Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6013192Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6013195Z 2025-12-04T13:44:25.6013429Z [rank3]:[W1204 13:21:25.545650397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6013599Z [rank2]:[W1204 13:21:25.547036796 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6013793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6014050Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6014213Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6014581Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6014784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6014888Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6014987Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6015083Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6015084Z 2025-12-04T13:44:25.6015319Z [rank2]:[W1204 13:21:25.549135219 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6015488Z [rank1]:[W1204 13:21:26.197246548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6015666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6015919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6016082Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6016446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6016646Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6016752Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6016866Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6016964Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6016965Z 2025-12-04T13:44:25.6017197Z [rank1]:[W1204 13:21:26.198729105 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6017370Z [rank3]:[W1204 13:21:26.546026941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6017595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6017875Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6018039Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6018404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6018608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6018715Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6018812Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6018907Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6018910Z 2025-12-04T13:44:25.6019143Z [rank3]:[W1204 13:21:26.548447417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6019312Z [rank2]:[W1204 13:21:26.549239439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6019486Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6019747Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6019910Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6020278Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6020480Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6020584Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6020680Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6020798Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6020800Z 2025-12-04T13:44:25.6021033Z [rank2]:[W1204 13:21:26.551274694 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6021202Z [rank1]:[W1204 13:21:27.198906814 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6021376Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6021629Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6021819Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6022188Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6022388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6022493Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6022589Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6022687Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6022688Z 2025-12-04T13:44:25.6022920Z [rank1]:[W1204 13:21:27.200173696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6023089Z [rank3]:[W1204 13:21:27.548636386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6023264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6023519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6023685Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6024054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6024258Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6024362Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6024459Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6024556Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6024559Z 2025-12-04T13:44:25.6024809Z [rank3]:[W1204 13:21:27.550498104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6024981Z [rank2]:[W1204 13:21:27.551379234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6025154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6025410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6025591Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6025963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6026168Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6026273Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6026371Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6026468Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6026470Z 2025-12-04T13:44:25.6026705Z [rank2]:[W1204 13:21:27.553563336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6026875Z [rank1]:[W1204 13:21:28.200357665 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6027050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6027304Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6027515Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6027883Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6028082Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6028187Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6028282Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6028379Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6028382Z 2025-12-04T13:44:25.6028639Z [rank1]:[W1204 13:21:28.201899421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6028810Z [rank3]:[W1204 13:21:28.550655285 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6028983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6029238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6029402Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6029794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6029995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6030098Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6030194Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6030291Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6030295Z 2025-12-04T13:44:25.6030528Z [rank3]:[W1204 13:21:28.552880225 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6030701Z [rank2]:[W1204 13:21:28.553665007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6030874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6031129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6031290Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6031661Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6031862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6031966Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6032062Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6032158Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6032159Z 2025-12-04T13:44:25.6032394Z [rank2]:[W1204 13:21:28.555726641 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6032583Z [rank1]:[W1204 13:21:29.202050902 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6032761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6033016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6033180Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6033546Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6033766Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6033871Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6033965Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6034061Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6034063Z 2025-12-04T13:44:25.6034294Z [rank1]:[W1204 13:21:29.203275275 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6034467Z [rank3]:[W1204 13:21:29.553085195 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6034642Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6034898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6035062Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6035429Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6035633Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6035737Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6035832Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6035930Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6035931Z 2025-12-04T13:44:25.6036164Z [rank3]:[W1204 13:21:29.555091370 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6036336Z [rank2]:[W1204 13:21:29.555848114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6036534Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6036789Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6036951Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6037319Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6037581Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6037684Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6037780Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6037876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6037878Z 2025-12-04T13:44:25.6038111Z [rank2]:[W1204 13:21:29.558069204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6038281Z [rank1]:[W1204 13:21:30.203514685 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6038456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6038708Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6038870Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6039235Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6039439Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6039544Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6039638Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6039735Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6039736Z 2025-12-04T13:44:25.6039968Z [rank1]:[W1204 13:21:30.205262306 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6040140Z [rank3]:[W1204 13:21:30.555280212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6040316Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6040597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6040761Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6041125Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6041354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6041459Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6041555Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6041653Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6041655Z 2025-12-04T13:44:25.6041886Z [rank3]:[W1204 13:21:30.557664229 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6042056Z [rank2]:[W1204 13:21:30.558179917 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6042231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6042488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6042652Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6043019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6043222Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6043327Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6043425Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6043520Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6043522Z 2025-12-04T13:44:25.6043757Z [rank2]:[W1204 13:21:30.560312270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6043925Z [rank1]:[W1204 13:21:31.205423629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6044101Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6044375Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6044538Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6044905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6045105Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6045230Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6045326Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6045423Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6045424Z 2025-12-04T13:44:25.6045656Z [rank1]:[W1204 13:21:31.206668531 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6045826Z [rank3]:[W1204 13:21:31.557832362 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6046002Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6046260Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6046423Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6046786Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6046987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6047092Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6047189Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6047286Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6047288Z 2025-12-04T13:44:25.6047566Z [rank3]:[W1204 13:21:31.559232230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6047737Z [rank2]:[W1204 13:21:31.560432444 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6047909Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6048170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6048357Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6048725Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6048927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6049044Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6049155Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6049253Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6049255Z 2025-12-04T13:44:25.6049489Z [rank2]:[W1204 13:21:31.562140556 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6049657Z [rank1]:[W1204 13:21:32.206819025 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6049830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6050084Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6050249Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6050618Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6050818Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6050922Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6051020Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6051118Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6051121Z 2025-12-04T13:44:25.6051354Z [rank1]:[W1204 13:21:32.209085404 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6051523Z [rank3]:[W1204 13:21:32.559338286 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6051697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6051950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6052139Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6052507Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6052708Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6052812Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6052907Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6053025Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6053027Z 2025-12-04T13:44:25.6053260Z [rank3]:[W1204 13:21:32.561579176 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6053430Z [rank2]:[W1204 13:21:32.562247411 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6053601Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6053854Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6054018Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6054386Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6054589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6054693Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6054789Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6054888Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6054890Z 2025-12-04T13:44:25.6055122Z [rank2]:[W1204 13:21:32.564248016 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6055291Z [rank1]:[W1204 13:21:33.209248539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6055467Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6055720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6055882Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6056264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6056465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6056571Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6056666Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6056763Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6056786Z 2025-12-04T13:44:25.6057021Z [rank1]:[W1204 13:21:33.210676687 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6057190Z [rank3]:[W1204 13:21:33.561756530 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6057365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6057646Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6057809Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6058178Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6058379Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6058483Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6058578Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6058676Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6058679Z 2025-12-04T13:44:25.6058914Z [rank3]:[W1204 13:21:33.562999453 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6059089Z [rank2]:[W1204 13:21:33.564374062 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6059263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6059521Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6059686Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6060086Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6060289Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6060392Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6060488Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6060584Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6060585Z 2025-12-04T13:44:25.6060818Z [rank2]:[W1204 13:21:33.566607822 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6061017Z [rank1]:[W1204 13:21:34.210849182 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6061193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6061450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6061612Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6061982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6062184Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6062290Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6062384Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6062482Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6062483Z 2025-12-04T13:44:25.6062717Z [rank1]:[W1204 13:21:34.212341229 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6062890Z [rank3]:[W1204 13:21:34.563166088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6063066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6063323Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6063488Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6063856Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6064082Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6064187Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6064283Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6064379Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6064381Z 2025-12-04T13:44:25.6064612Z [rank3]:[W1204 13:21:34.565283001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6064813Z [rank2]:[W1204 13:21:34.566727899 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6064989Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6065243Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6065405Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6065778Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6065984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6066087Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6066182Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6066279Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6066280Z 2025-12-04T13:44:25.6066513Z [rank2]:[W1204 13:21:34.569363930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6066683Z [rank1]:[W1204 13:21:35.212493505 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6066859Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6067114Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6067275Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6067667Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6067898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6068003Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6068098Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6068195Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6068197Z 2025-12-04T13:44:25.6068429Z [rank1]:[W1204 13:21:35.214095269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6068598Z [rank3]:[W1204 13:21:35.565433538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6068800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6069055Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6069218Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6069582Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6069785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6069891Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6069986Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6070083Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6070085Z 2025-12-04T13:44:25.6070316Z [rank3]:[W1204 13:21:35.567580480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6070486Z [rank2]:[W1204 13:21:35.569466718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6074325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6074590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6074754Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6075123Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6075330Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6075469Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6075567Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6075663Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6075665Z 2025-12-04T13:44:25.6075900Z [rank2]:[W1204 13:21:35.571809035 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6076070Z [rank1]:[W1204 13:21:36.214261037 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6076258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6076535Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6076697Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6077064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6077265Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6077372Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6077469Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6077604Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6077606Z 2025-12-04T13:44:25.6077840Z [rank1]:[W1204 13:21:36.216376709 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6078009Z [rank3]:[W1204 13:21:36.567774547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6078184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6078446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6078610Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6078977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6079180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6079287Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6079382Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6079510Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6079512Z 2025-12-04T13:44:25.6079744Z [rank3]:[W1204 13:21:36.569887829 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6079914Z [rank2]:[W1204 13:21:36.571918044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6080088Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6080371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6080536Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6080903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6081105Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6081209Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6081307Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6081405Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6081407Z 2025-12-04T13:44:25.6081639Z [rank2]:[W1204 13:21:36.573707654 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6081808Z [rank1]:[W1204 13:21:37.216587666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6081983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6082238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6082403Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6082768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6082970Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6083076Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6083172Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6083269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6083289Z 2025-12-04T13:44:25.6083522Z [rank1]:[W1204 13:21:37.218800597 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6083691Z [rank3]:[W1204 13:21:37.570038198 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6083865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6084120Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6084311Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6084678Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6084880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6084986Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6085082Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6085180Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6085183Z 2025-12-04T13:44:25.6085417Z [rank3]:[W1204 13:21:37.571271061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6085587Z [rank2]:[W1204 13:21:37.573819674 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6085761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6086015Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6086179Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6086544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6086746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6086850Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6086946Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6087044Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6087046Z 2025-12-04T13:44:25.6087299Z [rank2]:[W1204 13:21:37.576056544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6087470Z [rank1]:[W1204 13:21:38.218976856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6087688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6087942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6088117Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6088495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6088695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6088800Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6088895Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6088992Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6088996Z 2025-12-04T13:44:25.6089229Z [rank1]:[W1204 13:21:38.220921122 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6089400Z [rank3]:[W1204 13:21:38.571435980 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6089575Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6089830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6089990Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6090357Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6090559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6090664Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6090758Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6090855Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6090859Z 2025-12-04T13:44:25.6091091Z [rank3]:[W1204 13:21:38.573567432 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6091284Z [rank2]:[W1204 13:21:38.578576080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6091458Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6091715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6091878Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6092266Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6092468Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6092571Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6092667Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6092762Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6092764Z 2025-12-04T13:44:25.6092997Z [rank2]:[W1204 13:21:38.580997646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6093172Z [rank1]:[W1204 13:21:39.221096372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6093345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6093600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6093762Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6094132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6094335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6094440Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6094534Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6094630Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6094632Z 2025-12-04T13:44:25.6094865Z [rank1]:[W1204 13:21:39.223147346 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6095054Z [rank3]:[W1204 13:21:39.573762032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6095230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6095487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6095650Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6096022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6096248Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6096353Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6096449Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6096545Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6096547Z 2025-12-04T13:44:25.6096778Z [rank3]:[W1204 13:21:39.575642900 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6096949Z [rank2]:[W1204 13:21:39.581110778 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6097124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6097379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6097588Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6097953Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6098161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6098268Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6098364Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6098460Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6098462Z 2025-12-04T13:44:25.6098695Z [rank2]:[W1204 13:21:39.583031965 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6098866Z [rank1]:[W1204 13:21:40.223300177 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6099070Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6099326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6099486Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6099852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6100079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6100184Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6100280Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6100376Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6100378Z 2025-12-04T13:44:25.6100611Z [rank1]:[W1204 13:21:40.225231844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6100780Z [rank3]:[W1204 13:21:40.576960895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6100957Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6101212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6101375Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6101745Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6101946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6102051Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6102146Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6102242Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6102244Z 2025-12-04T13:44:25.6102476Z [rank3]:[W1204 13:21:40.578628918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6102647Z [rank2]:[W1204 13:21:40.583133667 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6102822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6103102Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6103267Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6103632Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6103834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6103961Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6104056Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6104152Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6104154Z 2025-12-04T13:44:25.6104387Z [rank2]:[W1204 13:21:40.585124003 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6104556Z [rank1]:[W1204 13:21:41.225375096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6104729Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6104987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6105148Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6105514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6105714Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6105821Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6105917Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6106012Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6106014Z 2025-12-04T13:44:25.6106247Z [rank1]:[W1204 13:21:41.226849543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6106416Z [rank3]:[W1204 13:21:41.578801819 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6106590Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6106866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6107029Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6107397Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6107636Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6107772Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6107866Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6107964Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6107966Z 2025-12-04T13:44:25.6108197Z [rank3]:[W1204 13:21:41.580985800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6108367Z [rank2]:[W1204 13:21:41.585232016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6108541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6108798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6108962Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6109330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6109534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6109637Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6109735Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6109832Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6109834Z 2025-12-04T13:44:25.6110068Z [rank2]:[W1204 13:21:41.587568284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6110238Z [rank1]:[W1204 13:21:42.227042284 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6110411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6110667Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6110857Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6111225Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6111427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6111533Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6111651Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6111746Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6111747Z 2025-12-04T13:44:25.6111982Z [rank1]:[W1204 13:21:42.229421571 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6112153Z [rank3]:[W1204 13:21:42.581137953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6112328Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6112584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6112750Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6113118Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6113319Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6113424Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6113519Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6113618Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6113619Z 2025-12-04T13:44:25.6113853Z [rank3]:[W1204 13:21:42.583218097 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6114023Z [rank2]:[W1204 13:21:42.587665118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6114196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6114451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6114616Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6115006Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6115208Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6115313Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6115408Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6115524Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6115526Z 2025-12-04T13:44:25.6115761Z [rank2]:[W1204 13:21:42.590114653 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6115931Z [rank1]:[W1204 13:21:43.229626454 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6116104Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6116359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6116520Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6116889Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6117088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6117194Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6117289Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6117383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6117387Z 2025-12-04T13:44:25.6117643Z [rank1]:[W1204 13:21:43.231588400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6117814Z [rank3]:[W1204 13:21:43.583307812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6117989Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6118246Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6118411Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6118806Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6119007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6119112Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6119207Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6119304Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6119318Z 2025-12-04T13:44:25.6119569Z [rank3]:[W1204 13:21:43.585214360 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6119743Z [rank2]:[W1204 13:21:43.590201488 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6119917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6120172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6120335Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6120703Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6120905Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6121010Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6121107Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6121203Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6121206Z 2025-12-04T13:44:25.6121437Z [rank2]:[W1204 13:21:43.592639354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6121611Z [rank1]:[W1204 13:21:44.231778793 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6121787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6122043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6122205Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6122591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6122794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6122898Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6122993Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6123089Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6123091Z 2025-12-04T13:44:25.6123323Z [rank1]:[W1204 13:21:44.233629522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6123519Z [rank3]:[W1204 13:21:44.585386214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6123694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6123951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6124115Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6124481Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6124684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6124788Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6124882Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6124979Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6124981Z 2025-12-04T13:44:25.6125213Z [rank3]:[W1204 13:21:44.587762461 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6125384Z [rank2]:[W1204 13:21:44.592731680 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6125559Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6125816Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6125981Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6126347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6126569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6126673Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6126769Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6126865Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6126868Z 2025-12-04T13:44:25.6127099Z [rank2]:[W1204 13:21:44.595155876 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6127279Z [rank1]:[W1204 13:21:45.233770368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6127503Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6127758Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6127920Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6128287Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6128493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6128598Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6128694Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6128789Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6128791Z 2025-12-04T13:44:25.6129025Z [rank1]:[W1204 13:21:45.235010920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6129195Z [rank3]:[W1204 13:21:45.587936056 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6129373Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6129633Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6129796Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6130167Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6130370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6130501Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6130596Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6130692Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6130694Z 2025-12-04T13:44:25.6130926Z [rank3]:[W1204 13:21:45.589586369 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6131096Z [rank2]:[W1204 13:21:45.595294972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6131295Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6131552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6131716Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6132084Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6132288Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6132395Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6132494Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6132591Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6132593Z 2025-12-04T13:44:25.6132825Z [rank2]:[W1204 13:21:45.596964934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6132995Z [rank1]:[W1204 13:21:46.235094968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6133167Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6133426Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6133588Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6133954Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6134155Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6134261Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6134376Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6134473Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6134474Z 2025-12-04T13:44:25.6134707Z [rank1]:[W1204 13:21:46.236999295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6134875Z [rank3]:[W1204 13:21:46.589699696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6135049Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6135329Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6135492Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6135858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6136058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6136166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6136261Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6136359Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6136361Z 2025-12-04T13:44:25.6136595Z [rank3]:[W1204 13:21:46.591990625 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6136766Z [rank2]:[W1204 13:21:46.597063912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6136940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6137195Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6137360Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6137768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6137970Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6138073Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6138173Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6138305Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6138308Z 2025-12-04T13:44:25.6138540Z [rank2]:[W1204 13:21:46.599385930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6138710Z [rank1]:[W1204 13:21:47.237182831 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6138885Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6139143Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6139332Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6139697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6139898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6140002Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6140100Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6140196Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6140198Z 2025-12-04T13:44:25.6140433Z [rank1]:[W1204 13:21:47.239303154 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6140603Z [rank3]:[W1204 13:21:47.592158532 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6140778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6141034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6141198Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6141565Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6141766Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6141872Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6141967Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6142065Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6142067Z 2025-12-04T13:44:25.6142326Z [rank3]:[W1204 13:21:47.594702576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6142497Z [rank2]:[W1204 13:21:47.599500299 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6142671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6142925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6143119Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6143489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6143691Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6143795Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6143891Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6143989Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6143991Z 2025-12-04T13:44:25.6144225Z [rank2]:[W1204 13:21:47.601965674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6144396Z [rank1]:[W1204 13:21:48.239369774 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6144569Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6144824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6144985Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6145355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6145559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6145663Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6145758Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6145855Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6145858Z 2025-12-04T13:44:25.6146110Z [rank1]:[W1204 13:21:48.241477967 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6146280Z [rank3]:[W1204 13:21:48.595300744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6146455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6146709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6146873Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6147263Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6147463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6147606Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6147701Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6147797Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6147799Z 2025-12-04T13:44:25.6148033Z [rank3]:[W1204 13:21:48.596887918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6148206Z [rank2]:[W1204 13:21:48.602081883 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6148382Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6148635Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6148799Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6149168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6149370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6149473Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6149568Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6149667Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6149668Z 2025-12-04T13:44:25.6149902Z [rank2]:[W1204 13:21:48.604352202 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6150118Z [rank1]:[W1204 13:21:49.241541098 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6150293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6150547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6150708Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6151076Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6151310Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6151414Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6151509Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6151604Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6151606Z 2025-12-04T13:44:25.6151839Z [rank1]:[W1204 13:21:49.243434055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6152014Z [rank3]:[W1204 13:21:49.597058697 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6152189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6152445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6152606Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6152971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6153173Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6153278Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6153372Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6153468Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6153470Z 2025-12-04T13:44:25.6153701Z [rank3]:[W1204 13:21:49.598685321 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6153873Z [rank2]:[W1204 13:21:49.604442422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6154068Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6154326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6154489Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6154854Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6155079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6155185Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6155280Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6155377Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6155379Z 2025-12-04T13:44:25.6155611Z [rank2]:[W1204 13:21:49.605592407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6155782Z [rank1]:[W1204 13:21:50.243542136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6155959Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6156215Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6156380Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6156746Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6156948Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6157052Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6157148Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6157244Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6157246Z 2025-12-04T13:44:25.6157503Z [rank1]:[W1204 13:21:50.245458263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6157673Z [rank3]:[W1204 13:21:50.598856900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6157850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6158145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6158308Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6158672Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6158900Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6159006Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6159101Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6159197Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6159199Z 2025-12-04T13:44:25.6159431Z [rank3]:[W1204 13:21:50.600503673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6159600Z [rank2]:[W1204 13:21:50.605700337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6159776Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6160031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6160195Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6160562Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6160764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6160870Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6160966Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6161063Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6161065Z 2025-12-04T13:44:25.6161298Z [rank2]:[W1204 13:21:50.608026746 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6161468Z [rank1]:[W1204 13:21:51.245876788 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6161641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6161916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6162077Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6162445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6162647Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6162769Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6162866Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6162962Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6162963Z 2025-12-04T13:44:25.6163196Z [rank1]:[W1204 13:21:51.248426381 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6163365Z [rank3]:[W1204 13:21:51.600667874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6163539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6163799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6163962Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6164329Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6164530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6164637Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6164732Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6164830Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6164832Z 2025-12-04T13:44:25.6165064Z [rank3]:[W1204 13:21:51.602838945 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6165233Z [rank2]:[W1204 13:21:51.608121358 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6165406Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6165659Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6165841Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6166208Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6166409Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6166515Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6166630Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6166730Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6166732Z 2025-12-04T13:44:25.6166969Z [rank2]:[W1204 13:21:51.609567025 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6167140Z [rank1]:[W1204 13:21:52.248547243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6167314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6167597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6167763Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6168132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6168334Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6168439Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6168536Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6168631Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6168633Z 2025-12-04T13:44:25.6168867Z [rank1]:[W1204 13:21:52.250664446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6169039Z [rank3]:[W1204 13:21:52.602993858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6169215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6169474Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6169639Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6170031Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6170233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6170338Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6170433Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6170555Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6170557Z 2025-12-04T13:44:25.6170793Z [rank3]:[W1204 13:21:52.604953984 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6170962Z [rank2]:[W1204 13:21:52.609722058 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6171138Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6171395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6171560Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6171932Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6172134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6172240Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6172335Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6172433Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6172435Z 2025-12-04T13:44:25.6172668Z [rank2]:[W1204 13:21:52.612004657 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6172839Z [rank1]:[W1204 13:21:53.250794769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6173012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6173267Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6173429Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6173823Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6174024Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6174128Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6174224Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6174319Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6174344Z 2025-12-04T13:44:25.6174579Z [rank1]:[W1204 13:21:53.252963161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6174748Z [rank3]:[W1204 13:21:53.605149275 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6174922Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6175177Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6175337Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6175710Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6175909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6176014Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6176109Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6176205Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6176207Z 2025-12-04T13:44:25.6176442Z [rank3]:[W1204 13:21:53.607540882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6176612Z [rank2]:[W1204 13:21:53.612139570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6176787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6177039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6177202Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6177636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6177841Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6177946Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6178042Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6178139Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6178141Z 2025-12-04T13:44:25.6178372Z [rank2]:[W1204 13:21:53.614425249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6178572Z [rank1]:[W1204 13:21:54.253099864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6178745Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6178999Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6179161Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6179524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6179732Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6179836Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6179932Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6180027Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6180029Z 2025-12-04T13:44:25.6180261Z [rank1]:[W1204 13:21:54.255429232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6180433Z [rank3]:[W1204 13:21:54.607695175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6180610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6180866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6181027Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6181392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6181614Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6181721Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6181817Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6181913Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6181915Z 2025-12-04T13:44:25.6182148Z [rank3]:[W1204 13:21:54.609781719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6182338Z [rank2]:[W1204 13:21:54.614516303 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6182514Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6182772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6182936Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6183306Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6183514Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6183619Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6183716Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6183813Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6183814Z 2025-12-04T13:44:25.6184046Z [rank2]:[W1204 13:21:54.616894400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6184217Z [rank1]:[W1204 13:21:55.255635655 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6184394Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6184650Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6184812Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6185179Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6185382Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6185504Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6185601Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6185698Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6185700Z 2025-12-04T13:44:25.6185933Z [rank1]:[W1204 13:21:55.257651480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6186103Z [rank3]:[W1204 13:21:55.609934043 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6186311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6186570Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6186732Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6187100Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6187302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6187408Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6187546Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6187643Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6187645Z 2025-12-04T13:44:25.6187878Z [rank3]:[W1204 13:21:55.611142396 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6188047Z [rank2]:[W1204 13:21:55.616985346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6188225Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6188479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6188642Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6189007Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6189210Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6189317Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6189439Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6189537Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6189539Z 2025-12-04T13:44:25.6189770Z [rank2]:[W1204 13:21:55.618137250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6189941Z [rank1]:[W1204 13:21:56.257823444 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6190114Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6190398Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6190561Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6190925Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6191126Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6191232Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6191330Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6191426Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6191427Z 2025-12-04T13:44:25.6191661Z [rank1]:[W1204 13:21:56.259164664 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6191831Z [rank3]:[W1204 13:21:56.611317550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6192004Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6192263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6192426Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6192792Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6192993Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6193100Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6193195Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6193311Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6193313Z 2025-12-04T13:44:25.6193547Z [rank3]:[W1204 13:21:56.612553483 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6193717Z [rank2]:[W1204 13:21:56.618236226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6193893Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6194156Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6194341Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6194708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6194909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6195015Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6195112Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6195210Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6195212Z 2025-12-04T13:44:25.6195443Z [rank2]:[W1204 13:21:56.619477139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6195613Z [rank1]:[W1204 13:21:57.259335709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6195788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6196045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6196211Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6196576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6196776Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6196880Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6196977Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6197073Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6197075Z 2025-12-04T13:44:25.6197328Z [rank1]:[W1204 13:21:57.260615451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6197544Z [rank3]:[W1204 13:21:57.612725708 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6197721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6197980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6198166Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6198534Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6198735Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6198841Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6198937Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6199033Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6199035Z 2025-12-04T13:44:25.6199269Z [rank3]:[W1204 13:21:57.613961021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6199438Z [rank2]:[W1204 13:21:57.619583315 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6199613Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6199866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6200032Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6200401Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6200602Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6200708Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6200803Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6200900Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6200904Z 2025-12-04T13:44:25.6201160Z [rank2]:[W1204 13:21:57.620905026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6201332Z [rank1]:[W1204 13:21:58.260753637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6201505Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6201761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6201924Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6202324Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6202527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6202631Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6202727Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6202823Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6202827Z 2025-12-04T13:44:25.6203060Z [rank1]:[W1204 13:21:58.262092947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6203232Z [rank3]:[W1204 13:21:58.614118497 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6203405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6203662Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6203823Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6204194Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6204396Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6204500Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6204596Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6204692Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6204694Z 2025-12-04T13:44:25.6204927Z [rank3]:[W1204 13:21:58.616384057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6205122Z [rank2]:[W1204 13:21:58.621038763 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6205298Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6205552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6205714Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6206083Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6206305Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6206411Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6206507Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6206604Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6206606Z 2025-12-04T13:44:25.6206838Z [rank2]:[W1204 13:21:58.623186535 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6207013Z [rank1]:[W1204 13:21:59.262251324 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6207188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6207444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6207644Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6208010Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6208214Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6208317Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6208413Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6208508Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6208510Z 2025-12-04T13:44:25.6208745Z [rank1]:[W1204 13:21:59.264347988 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6208918Z [rank3]:[W1204 13:21:59.616540794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6209120Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6209376Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6209538Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6209904Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6210131Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6210236Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6210332Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6210428Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6210429Z 2025-12-04T13:44:25.6210662Z [rank3]:[W1204 13:21:59.618915571 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6210835Z [rank2]:[W1204 13:21:59.623321743 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6211013Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6211266Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6211430Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6211798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6212002Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6212107Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6212202Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6212299Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6212300Z 2025-12-04T13:44:25.6212532Z [rank2]:[W1204 13:21:59.624962416 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6212705Z [rank1]:[W1204 13:22:00.264460757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6212898Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6213156Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6213319Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6213683Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6213906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6214011Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6214107Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6214202Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6214206Z 2025-12-04T13:44:25.6214437Z [rank1]:[W1204 13:22:00.266473652 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6214608Z [rank3]:[W1204 13:22:00.619072150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6214785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6215046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6215206Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6215573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6215775Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6215881Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6215978Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6216074Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6216075Z 2025-12-04T13:44:25.6216307Z [rank3]:[W1204 13:22:00.621251701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6216476Z [rank2]:[W1204 13:22:00.625072876 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6216651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6216931Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6217098Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6217467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6217701Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6217835Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6217931Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6218028Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6218030Z 2025-12-04T13:44:25.6218260Z [rank2]:[W1204 13:22:00.627074241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6218431Z [rank1]:[W1204 13:22:01.266643060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6218605Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6218862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6219025Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6219392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6219594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6219699Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6219797Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6219892Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6219896Z 2025-12-04T13:44:25.6220126Z [rank1]:[W1204 13:22:01.268971528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6220297Z [rank3]:[W1204 13:22:01.621592086 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6220470Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6220729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6220916Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6221281Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6221483Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6221599Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6221705Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6221801Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6221803Z 2025-12-04T13:44:25.6222036Z [rank3]:[W1204 13:22:01.623410055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6222204Z [rank2]:[W1204 13:22:01.627177291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6222378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6222631Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6222799Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6223166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6223367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6223472Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6223569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6223666Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6223670Z 2025-12-04T13:44:25.6223901Z [rank2]:[W1204 13:22:01.628516412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6224071Z [rank1]:[W1204 13:22:02.269127888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6224246Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6224499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6224682Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6225047Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6225249Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6225355Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6225451Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6225566Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6225570Z 2025-12-04T13:44:25.6225805Z [rank1]:[W1204 13:22:02.271408557 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6225977Z [rank3]:[W1204 13:22:02.623583365 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6226149Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6226405Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6226569Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6226939Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6227140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6227244Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6227340Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6227437Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6227439Z 2025-12-04T13:44:25.6227690Z [rank3]:[W1204 13:22:02.625791476 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6227861Z [rank2]:[W1204 13:22:02.628623493 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6228036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6228291Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6228455Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6228850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6229051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6229156Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6229251Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6229348Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6229379Z 2025-12-04T13:44:25.6229612Z [rank2]:[W1204 13:22:02.630616998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6229782Z [rank1]:[W1204 13:22:03.271503959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6229957Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6230212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6230374Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6230743Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6230944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6231048Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6231145Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6231240Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6231245Z 2025-12-04T13:44:25.6231477Z [rank1]:[W1204 13:22:03.272798700 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6231649Z [rank3]:[W1204 13:22:03.626131042 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6231823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6232079Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6232240Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6232644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6232845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6232949Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6233045Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6233140Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6233142Z 2025-12-04T13:44:25.6233374Z [rank3]:[W1204 13:22:03.627739426 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6233564Z [rank2]:[W1204 13:22:03.630721380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6233738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6233991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6234156Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6234523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6234727Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6234831Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6234926Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6235023Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6235025Z 2025-12-04T13:44:25.6235256Z [rank2]:[W1204 13:22:03.631891224 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6235429Z [rank1]:[W1204 13:22:04.272995700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6235605Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6235860Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6236024Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6236389Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6236612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6236717Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6236815Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6236913Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6236914Z 2025-12-04T13:44:25.6237147Z [rank1]:[W1204 13:22:04.275373727 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6237337Z [rank3]:[W1204 13:22:04.627874259 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6237560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6237816Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6237977Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6238342Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6238546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6238651Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6238746Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6238841Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6238843Z 2025-12-04T13:44:25.6239077Z [rank3]:[W1204 13:22:04.630210626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6239247Z [rank2]:[W1204 13:22:04.632012356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6239422Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6239676Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6239838Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6240204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6240455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6240560Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6240655Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6240751Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6240753Z 2025-12-04T13:44:25.6240984Z [rank2]:[W1204 13:22:04.633752858 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6241156Z [rank1]:[W1204 13:22:05.275715645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6241372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6241627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6241790Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6242156Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6242358Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6242463Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6242558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6242654Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6242656Z 2025-12-04T13:44:25.6242887Z [rank1]:[W1204 13:22:05.277881697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6243057Z [rank3]:[W1204 13:22:05.630957496 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6243231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6243490Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6243652Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6244016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6244218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6244342Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6244438Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6244532Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6244534Z 2025-12-04T13:44:25.6244768Z [rank3]:[W1204 13:22:05.632560670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6244938Z [rank2]:[W1204 13:22:05.633883041 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6245124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6245406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6245570Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6245937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6246137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6246245Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6246340Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6246438Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6246440Z 2025-12-04T13:44:25.6246674Z [rank2]:[W1204 13:22:05.635927775 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6246844Z [rank1]:[W1204 13:22:06.278087519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6247018Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6247274Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6247436Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6247851Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6248052Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6248158Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6248253Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6248383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6248385Z 2025-12-04T13:44:25.6248617Z [rank1]:[W1204 13:22:06.279961687 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6248787Z [rank3]:[W1204 13:22:06.632732103 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6248960Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6249244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6249407Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6249775Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6249977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6250082Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6250178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6250275Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6250276Z 2025-12-04T13:44:25.6250512Z [rank3]:[W1204 13:22:06.634652070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6250681Z [rank2]:[W1204 13:22:06.636032129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6250855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6251110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6251276Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6251642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6251843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6251947Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6252044Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6252141Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6252162Z 2025-12-04T13:44:25.6252396Z [rank2]:[W1204 13:22:06.637357280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6252565Z [rank1]:[W1204 13:22:07.280154390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6252742Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6252998Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6253182Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6253547Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6253750Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6253855Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6253950Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6254047Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6254049Z 2025-12-04T13:44:25.6254283Z [rank1]:[W1204 13:22:07.282130986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6254455Z [rank3]:[W1204 13:22:07.634804544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6254628Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6254887Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6255050Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6255416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6255617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6255721Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6255817Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6255915Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6255916Z 2025-12-04T13:44:25.6256169Z [rank3]:[W1204 13:22:07.637188261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6256339Z [rank2]:[W1204 13:22:07.637470435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6256513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6256768Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6256951Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6257321Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6257552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6257658Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6257753Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6257850Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6257854Z 2025-12-04T13:44:25.6258089Z [rank2]:[W1204 13:22:07.639504600 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6258258Z [rank1]:[W1204 13:22:08.282316260 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6258432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6258689Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6258856Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6259226Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6259429Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6259534Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6259629Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6259726Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6259730Z 2025-12-04T13:44:25.6259963Z [rank1]:[W1204 13:22:08.283958184 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6260160Z [rank3]:[W1204 13:22:08.637312027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6260334Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6260590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6260752Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6261151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6261353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6261457Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6261553Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6261648Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6261650Z 2025-12-04T13:44:25.6261884Z [rank3]:[W1204 13:22:08.639633435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6262058Z [rank2]:[W1204 13:22:08.639581176 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6262233Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6262488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6262650Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6263016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6263218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6263323Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6263418Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6263517Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6263518Z 2025-12-04T13:44:25.6263754Z [rank2]:[W1204 13:22:08.640702221 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6263944Z [rank1]:[W1204 13:22:09.284172628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6264121Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6264375Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6264537Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6264901Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6265122Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6265227Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6265322Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6265418Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6265420Z 2025-12-04T13:44:25.6265655Z [rank1]:[W1204 13:22:09.286668062 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6265829Z [rank3]:[W1204 13:22:09.639812180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6266003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6266260Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6266422Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6266789Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6266995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6267099Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6267194Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6267289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6267291Z 2025-12-04T13:44:25.6267555Z [rank3]:[W1204 13:22:09.642214257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6267727Z [rank2]:[W1204 13:22:09.640829968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6267928Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6268185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6268347Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6268715Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6268942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6269047Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6269143Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6269242Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6269243Z 2025-12-04T13:44:25.6269476Z [rank2]:[W1204 13:22:09.642903931 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6269646Z [rank1]:[W1204 13:22:10.286865197 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6269821Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6270078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6270242Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6270607Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6270810Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6270918Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6271012Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6271108Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6271110Z 2025-12-04T13:44:25.6271344Z [rank1]:[W1204 13:22:10.289336812 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6271515Z [rank3]:[W1204 13:22:10.642379773 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6271692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6271968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6272131Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6272494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6272695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6272821Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6272917Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6273012Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6273013Z 2025-12-04T13:44:25.6273248Z [rank3]:[W1204 13:22:10.644673402 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6273418Z [rank2]:[W1204 13:22:10.643031669 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6273592Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6273853Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6274014Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6274380Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6274580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6274686Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6274783Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6274879Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6274881Z 2025-12-04T13:44:25.6275114Z [rank2]:[W1204 13:22:10.645422815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6275282Z [rank1]:[W1204 13:22:11.289482190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6275456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6275740Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6275904Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6276271Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6276475Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6276599Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6276693Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6276792Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6276793Z 2025-12-04T13:44:25.6277029Z [rank1]:[W1204 13:22:11.291539564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6277204Z [rank3]:[W1204 13:22:11.644845469 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6277377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6277657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6277821Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6278187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6278388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6278491Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6278588Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6278685Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6278687Z 2025-12-04T13:44:25.6278922Z [rank3]:[W1204 13:22:11.646792006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6279092Z [rank2]:[W1204 13:22:11.645530004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6279269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6279525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6279715Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6280084Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6280285Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6280391Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6280511Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6280608Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6280611Z 2025-12-04T13:44:25.6280844Z [rank2]:[W1204 13:22:11.647211156 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6281014Z [rank1]:[W1204 13:22:12.291682642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6281189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6281444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6281610Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6281977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6282178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6282283Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6282378Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6282476Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6282478Z 2025-12-04T13:44:25.6282712Z [rank1]:[W1204 13:22:12.293451163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6282883Z [rank2]:[W1204 13:22:12.647294706 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6283057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6283315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6283480Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6283868Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6284070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6284173Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6284269Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6284385Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6284387Z 2025-12-04T13:44:25.6284621Z [rank2]:[W1204 13:22:12.649130995 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6284793Z [rank3]:[W1204 13:22:12.646974933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6284967Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6285223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6285385Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6285758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6285960Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6286065Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6286160Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6286256Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6286259Z 2025-12-04T13:44:25.6286494Z [rank3]:[W1204 13:22:12.649150335 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6286663Z [rank1]:[W1204 13:22:13.293617711 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6286837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6287089Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6287252Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6287677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6287880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6287987Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6288082Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6288178Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6288195Z 2025-12-04T13:44:25.6288441Z [rank1]:[W1204 13:22:13.295547598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6288611Z [rank2]:[W1204 13:22:13.649262215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6288786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6289041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6289204Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6289574Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6289779Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6289883Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6289979Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6290075Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6290076Z 2025-12-04T13:44:25.6290312Z [rank2]:[W1204 13:22:13.651041515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6290486Z [rank3]:[W1204 13:22:13.649309114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6290661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6290915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6291077Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6291467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6291670Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6291775Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6291871Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6291966Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6291968Z 2025-12-04T13:44:25.6292200Z [rank3]:[W1204 13:22:13.651439656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6292395Z [rank1]:[W1204 13:22:14.295667119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6292572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6292825Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6292988Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6293355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6293559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6293663Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6293758Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6293855Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6293857Z 2025-12-04T13:44:25.6294089Z [rank1]:[W1204 13:22:14.297842300 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6294260Z [rank2]:[W1204 13:22:14.651181655 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6294435Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6294694Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6294858Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6295223Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6295448Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6295553Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6295648Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6295744Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6295745Z 2025-12-04T13:44:25.6295979Z [rank2]:[W1204 13:22:14.653298998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6296160Z [rank3]:[W1204 13:22:14.651588256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6296345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6296601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6296764Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6297132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6297346Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6297452Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6297585Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6297681Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6297682Z 2025-12-04T13:44:25.6297915Z [rank3]:[W1204 13:22:14.653793687 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6298085Z [rank1]:[W1204 13:22:15.297995370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6298263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6298518Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6298681Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6299051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6299255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6299388Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6299484Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6299580Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6299582Z 2025-12-04T13:44:25.6299814Z [rank1]:[W1204 13:22:15.299815840 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6299985Z [rank2]:[W1204 13:22:15.653380540 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6300186Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6300444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6300608Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6300974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6301178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6301285Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6301383Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6301479Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6301481Z 2025-12-04T13:44:25.6301714Z [rank2]:[W1204 13:22:15.654530745 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6301885Z [rank3]:[W1204 13:22:15.654036866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6302058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6302316Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6302478Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6302846Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6303048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6303155Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6303269Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6303365Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6303367Z 2025-12-04T13:44:25.6303602Z [rank3]:[W1204 13:22:15.655497133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6303770Z [rank1]:[W1204 13:22:16.299979471 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6303945Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6304229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6304391Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6304758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6304957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6305066Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6305162Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6305259Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6305261Z 2025-12-04T13:44:25.6305493Z [rank1]:[W1204 13:22:16.301985386 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6305664Z [rank2]:[W1204 13:22:16.654640187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6305837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6306094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6306266Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6306634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6306836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6306941Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6307038Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6307159Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6307161Z 2025-12-04T13:44:25.6307396Z [rank2]:[W1204 13:22:16.656269741 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6307597Z [rank3]:[W1204 13:22:16.655592916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6307770Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6308028Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6308219Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6308589Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6308791Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6308898Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6308997Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6309093Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6309095Z 2025-12-04T13:44:25.6309339Z [rank3]:[W1204 13:22:16.657939814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6309512Z [rank1]:[W1204 13:22:17.302202047 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6309687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6309941Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6310106Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6310475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6310676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6310781Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6310876Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6310975Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6310977Z 2025-12-04T13:44:25.6311241Z [rank1]:[W1204 13:22:17.304313150 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6311413Z [rank2]:[W1204 13:22:17.656457963 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6311587Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6311847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6312033Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6312401Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6312603Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6312708Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6312805Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6312903Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6312906Z 2025-12-04T13:44:25.6313140Z [rank2]:[W1204 13:22:17.658489067 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6313310Z [rank3]:[W1204 13:22:17.658068397 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6313483Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6313739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6313903Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6314273Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6314476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6314581Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6314676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6314772Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6314775Z 2025-12-04T13:44:25.6315030Z [rank3]:[W1204 13:22:17.660060452 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6315200Z [rank1]:[W1204 13:22:18.304491772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6315374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6315627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6315791Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6316181Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6316381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6316488Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6316582Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6316678Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6316680Z 2025-12-04T13:44:25.6316913Z [rank1]:[W1204 13:22:18.306485068 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6317084Z [rank2]:[W1204 13:22:18.658609501 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6317259Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6317549Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6317711Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6318083Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6318288Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6318393Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6318490Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6318586Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6318589Z 2025-12-04T13:44:25.6318823Z [rank2]:[W1204 13:22:18.660914630 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6319020Z [rank3]:[W1204 13:22:18.660170247 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6319196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6319451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6319613Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6319982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6320217Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6320321Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6320417Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6320513Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6320514Z 2025-12-04T13:44:25.6320749Z [rank3]:[W1204 13:22:18.662113543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6320922Z [rank1]:[W1204 13:22:19.306649071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6321098Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6325163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6325336Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6325708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6325917Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6326026Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6326122Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6326219Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6326221Z 2025-12-04T13:44:25.6326460Z [rank1]:[W1204 13:22:19.309205374 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6326635Z [rank2]:[W1204 13:22:19.661106553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6326842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6327101Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6327267Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6327673Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6327910Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6328015Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6328113Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6328209Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6328212Z 2025-12-04T13:44:25.6328447Z [rank2]:[W1204 13:22:19.663022531 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6328618Z [rank3]:[W1204 13:22:19.662255298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6328798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6329056Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6329218Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6329584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6329787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6329892Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6329989Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6330085Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6330087Z 2025-12-04T13:44:25.6330320Z [rank3]:[W1204 13:22:19.664310822 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6330488Z [rank1]:[W1204 13:22:20.309375468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6330664Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6330947Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6331110Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6331476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6331698Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6331804Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6331899Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6331996Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6331998Z 2025-12-04T13:44:25.6332231Z [rank1]:[W1204 13:22:20.311150899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6332401Z [rank2]:[W1204 13:22:20.663185435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6332576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6332831Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6332995Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6333362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6333564Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6333669Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6333766Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6333862Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6333866Z 2025-12-04T13:44:25.6334098Z [rank2]:[W1204 13:22:20.664965935 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6334267Z [rank3]:[W1204 13:22:20.664440027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6334439Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6334719Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6334881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6335250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6335451Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6335575Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6335673Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6335769Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6335771Z 2025-12-04T13:44:25.6336003Z [rank3]:[W1204 13:22:20.667132257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6336172Z [rank1]:[W1204 13:22:21.311323214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6336348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6336605Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6336768Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6337133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6337335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6337443Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6337581Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6337679Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6337681Z 2025-12-04T13:44:25.6337914Z [rank1]:[W1204 13:22:21.312583896 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6338086Z [rank2]:[W1204 13:22:21.665138690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6338260Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6338514Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6338712Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6339078Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6339280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6339383Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6339505Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6339603Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6339606Z 2025-12-04T13:44:25.6339840Z [rank2]:[W1204 13:22:21.667818411 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6340009Z [rank3]:[W1204 13:22:21.667273473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6340182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6340438Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6340602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6340967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6341168Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6341273Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6341370Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6341466Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6341468Z 2025-12-04T13:44:25.6341704Z [rank3]:[W1204 13:22:21.669384026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6341875Z [rank1]:[W1204 13:22:22.312719882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6342050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6342303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6342466Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6342849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6343049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6343155Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6343250Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6343366Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6343368Z 2025-12-04T13:44:25.6343601Z [rank1]:[W1204 13:22:22.314109631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6343772Z [rank2]:[W1204 13:22:22.667951388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6343946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6344201Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6344366Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6344734Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6344935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6345038Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6345134Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6345232Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6345235Z 2025-12-04T13:44:25.6345468Z [rank2]:[W1204 13:22:22.670408573 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6345637Z [rank3]:[W1204 13:22:22.669498533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6345811Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6346068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6346232Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6346622Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6346823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6346927Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6347023Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6347118Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6347139Z 2025-12-04T13:44:25.6347373Z [rank3]:[W1204 13:22:22.671746243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6347584Z [rank1]:[W1204 13:22:23.314235839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6347757Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6348010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6348172Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6348544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6348743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6348848Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6348944Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6349040Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6349042Z 2025-12-04T13:44:25.6349276Z [rank1]:[W1204 13:22:23.316145316 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6349447Z [rank2]:[W1204 13:22:23.670565160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6349621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6349873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6350034Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6350433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6350641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6350745Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6350841Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6350939Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6350941Z 2025-12-04T13:44:25.6351174Z [rank2]:[W1204 13:22:23.672407359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6351370Z [rank3]:[W1204 13:22:23.671881201 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6351542Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6351797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6351958Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6352322Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6352527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6352631Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6352726Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6352822Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6352824Z 2025-12-04T13:44:25.6353058Z [rank3]:[W1204 13:22:23.674030183 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6353229Z [rank1]:[W1204 13:22:24.316323443 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6353404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6353659Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6353820Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6354186Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6354413Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6354519Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6354614Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6354710Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6354712Z 2025-12-04T13:44:25.6354945Z [rank1]:[W1204 13:22:24.318699801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6355137Z [rank2]:[W1204 13:22:24.672570097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6355312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6355567Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6355730Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6356096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6356303Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6356407Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6356503Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6356601Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6356603Z 2025-12-04T13:44:25.6356837Z [rank2]:[W1204 13:22:24.674915885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6357009Z [rank3]:[W1204 13:22:24.674167321 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6357188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6357445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6357647Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6358012Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6358214Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6358342Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6358438Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6358533Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6358535Z 2025-12-04T13:44:25.6358770Z [rank3]:[W1204 13:22:24.676702135 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6358939Z [rank1]:[W1204 13:22:25.318873599 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6359141Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6359400Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6359563Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6359932Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6360134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6360240Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6360335Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6360431Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6360433Z 2025-12-04T13:44:25.6360664Z [rank1]:[W1204 13:22:25.320644029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6360834Z [rank2]:[W1204 13:22:25.675102133 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6361010Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6361264Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6361428Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6361794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6361996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6362101Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6362215Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6362314Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6362315Z 2025-12-04T13:44:25.6362547Z [rank2]:[W1204 13:22:25.677151787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6362718Z [rank3]:[W1204 13:22:25.676835794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6362890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6363170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6363331Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6363700Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6363903Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6364009Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6364106Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6364201Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6364203Z 2025-12-04T13:44:25.6364436Z [rank3]:[W1204 13:22:25.678958127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6364604Z [rank1]:[W1204 13:22:26.320800338 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6364778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6365033Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6365196Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6365561Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6365762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6365869Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6365963Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6366079Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6366081Z 2025-12-04T13:44:25.6366313Z [rank1]:[W1204 13:22:26.322194747 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6366483Z [rank2]:[W1204 13:22:26.677265288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6366657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6366911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6367102Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6367466Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6367712Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6367818Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6367915Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6368013Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6368015Z 2025-12-04T13:44:25.6368246Z [rank2]:[W1204 13:22:26.679731772 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6368415Z [rank3]:[W1204 13:22:26.679052808 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6368588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6368843Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6369007Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6369375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6369577Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6369679Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6369778Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6369875Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6369877Z 2025-12-04T13:44:25.6370140Z [rank3]:[W1204 13:22:26.681022904 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6370312Z [rank1]:[W1204 13:22:27.322313758 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6370486Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6370741Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6370929Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6371295Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6371496Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6371600Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6371695Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6371794Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6371795Z 2025-12-04T13:44:25.6372034Z [rank1]:[W1204 13:22:27.323473212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6372204Z [rank2]:[W1204 13:22:27.679846234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6372379Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6372633Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6372797Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6373162Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6373363Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6373468Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6373563Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6373660Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6373664Z 2025-12-04T13:44:25.6373916Z [rank2]:[W1204 13:22:27.681476068 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6374089Z [rank3]:[W1204 13:22:27.681126445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6374262Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6374517Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6374678Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6375070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6375270Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6375373Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6375468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6375563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6375566Z 2025-12-04T13:44:25.6375801Z [rank3]:[W1204 13:22:27.683100521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6375969Z [rank1]:[W1204 13:22:28.323801439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6376144Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6376400Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6376562Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6376930Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6377130Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6377234Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6377329Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6377425Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6377427Z 2025-12-04T13:44:25.6377685Z [rank1]:[W1204 13:22:28.325337555 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6377885Z [rank2]:[W1204 13:22:28.681642958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6378059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6378312Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6378477Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6378858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6379073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6379177Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6379272Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6379369Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6379371Z 2025-12-04T13:44:25.6379602Z [rank2]:[W1204 13:22:28.683266042 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6379776Z [rank3]:[W1204 13:22:28.683257263 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6379949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6380203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6380365Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6380733Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6380937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6381039Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6381135Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6381229Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6381231Z 2025-12-04T13:44:25.6381465Z [rank3]:[W1204 13:22:28.685325936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6381636Z [rank1]:[W1204 13:22:29.325487087 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6381832Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6382088Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6382250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6382618Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6382846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6382951Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6383045Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6383142Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6383144Z 2025-12-04T13:44:25.6383377Z [rank1]:[W1204 13:22:29.326699600 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6383547Z [rank2]:[W1204 13:22:29.683433994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6383722Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6383974Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6384138Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6384510Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6384719Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6384824Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6384920Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6385016Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6385018Z 2025-12-04T13:44:25.6385250Z [rank2]:[W1204 13:22:29.685178505 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6385420Z [rank3]:[W1204 13:22:29.685599966 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6385616Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6385872Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6386035Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6386402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6386627Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6386733Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6386830Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6386926Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6386928Z 2025-12-04T13:44:25.6387162Z [rank3]:[W1204 13:22:29.687486394 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6387332Z [rank1]:[W1204 13:22:30.326897002 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6387546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6387801Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6387963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6388330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6388533Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6388640Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6388736Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6388833Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6388835Z 2025-12-04T13:44:25.6389070Z [rank1]:[W1204 13:22:30.328634263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6389238Z [rank2]:[W1204 13:22:30.685326968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6389415Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6389694Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6389859Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6390223Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6390426Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6390555Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6390651Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6390748Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6390750Z 2025-12-04T13:44:25.6390984Z [rank2]:[W1204 13:22:30.687425581 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6391154Z [rank3]:[W1204 13:22:30.687608157 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6391328Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6391588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6391751Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6392117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6392320Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6392425Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6392523Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6392618Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6392620Z 2025-12-04T13:44:25.6392854Z [rank3]:[W1204 13:22:30.689476816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6393025Z [rank1]:[W1204 13:22:31.328835045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6393199Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6393457Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6393638Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6394005Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6394205Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6394330Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6394424Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6394523Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6394525Z 2025-12-04T13:44:25.6394757Z [rank1]:[W1204 13:22:31.330686554 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6394927Z [rank2]:[W1204 13:22:31.687638604 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6395100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6395356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6395523Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6395890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6396091Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6396195Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6396292Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6396389Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6396392Z 2025-12-04T13:44:25.6396625Z [rank2]:[W1204 13:22:31.689669568 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6396794Z [rank3]:[W1204 13:22:31.689633599 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6396968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6397223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6397415Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6397819Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6398021Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6398124Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6398238Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6398348Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6398350Z 2025-12-04T13:44:25.6398585Z [rank3]:[W1204 13:22:31.691568966 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6398759Z [rank1]:[W1204 13:22:32.330795329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6398933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6399187Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6399356Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6399724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6399924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6400031Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6400130Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6400229Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6400231Z 2025-12-04T13:44:25.6400468Z [rank1]:[W1204 13:22:32.332279576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6400645Z [rank2]:[W1204 13:22:32.689825183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6400819Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6401074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6401242Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6401636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6401845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6401950Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6402045Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6402145Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6402170Z 2025-12-04T13:44:25.6402407Z [rank2]:[W1204 13:22:32.691896656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6402578Z [rank3]:[W1204 13:22:32.691704471 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6402755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6403012Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6403179Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6403548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6403755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6403859Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6403958Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6404055Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6404058Z 2025-12-04T13:44:25.6404295Z [rank3]:[W1204 13:22:32.693261896 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6404469Z [rank1]:[W1204 13:22:33.332428551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6404643Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6404898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6405062Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6405452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6405654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6405759Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6405858Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6405954Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6405956Z 2025-12-04T13:44:25.6406194Z [rank1]:[W1204 13:22:33.333983126 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6406386Z [rank2]:[W1204 13:22:33.692058121 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6406563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6406821Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6406991Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6407365Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6407605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6407709Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6407804Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6407902Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6407904Z 2025-12-04T13:44:25.6408139Z [rank2]:[W1204 13:22:33.694268312 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6408317Z [rank3]:[W1204 13:22:33.693385312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6408495Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6408753Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6408915Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6409285Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6409512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6409616Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6409712Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6409807Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6409809Z 2025-12-04T13:44:25.6410048Z [rank3]:[W1204 13:22:33.695379777 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6410249Z [rank1]:[W1204 13:22:34.334144002 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6410423Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6410681Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6410844Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6411215Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6411418Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6411523Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6411619Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6411715Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6411716Z 2025-12-04T13:44:25.6411953Z [rank1]:[W1204 13:22:34.336018810 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6412128Z [rank2]:[W1204 13:22:34.694405449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6412303Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6412557Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6412719Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6413084Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6413312Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6413419Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6413514Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6413611Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6413613Z 2025-12-04T13:44:25.6413845Z [rank2]:[W1204 13:22:34.695947544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6414015Z [rank3]:[W1204 13:22:34.695535073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6414211Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6414472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6414636Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6415004Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6415207Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6415311Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6415406Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6415502Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6415503Z 2025-12-04T13:44:25.6415736Z [rank3]:[W1204 13:22:34.697887881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6415907Z [rank1]:[W1204 13:22:35.336199136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6416081Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6416337Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6416499Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6416867Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6417068Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6417194Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6417291Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6417387Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6417389Z 2025-12-04T13:44:25.6417656Z [rank1]:[W1204 13:22:35.337863849 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6417824Z [rank2]:[W1204 13:22:35.696129390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6418014Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6418282Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6418446Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6418813Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6419013Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6419120Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6419217Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6419314Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6419315Z 2025-12-04T13:44:25.6419549Z [rank2]:[W1204 13:22:35.698417009 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6419720Z [rank3]:[W1204 13:22:35.698060877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6419893Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6420151Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6420316Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6420685Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6420890Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6420995Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6421118Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6421214Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6421217Z 2025-12-04T13:44:25.6421448Z [rank3]:[W1204 13:22:35.699924226 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6421618Z [rank1]:[W1204 13:22:36.338004976 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6421790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6422068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6422232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6422599Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6422802Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6422907Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6423004Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6423101Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6423103Z 2025-12-04T13:44:25.6423337Z [rank1]:[W1204 13:22:36.339225609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6423509Z [rank2]:[W1204 13:22:36.698593226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6423683Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6423937Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6424104Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6424472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6424674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6424780Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6424876Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6424975Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6424997Z 2025-12-04T13:44:25.6425232Z [rank2]:[W1204 13:22:36.700793437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6425402Z [rank3]:[W1204 13:22:36.700068633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6425575Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6425829Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6426023Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6426386Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6426586Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6426690Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6426786Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6426882Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6426885Z 2025-12-04T13:44:25.6427118Z [rank3]:[W1204 13:22:36.701268637 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6427290Z [rank1]:[W1204 13:22:37.339431226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6427463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6427759Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6427922Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6428289Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6428490Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6428593Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6428690Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6428787Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6428789Z 2025-12-04T13:44:25.6429045Z [rank1]:[W1204 13:22:37.341658036 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6429215Z [rank2]:[W1204 13:22:37.700927086 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6429390Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6429647Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6429835Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6430202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6430403Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6430508Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6430602Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6430699Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6430703Z 2025-12-04T13:44:25.6430937Z [rank2]:[W1204 13:22:37.702991430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6431107Z [rank3]:[W1204 13:22:37.701397856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6431281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6431535Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6431699Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6432069Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6432269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6432373Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6432469Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6432563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6432568Z 2025-12-04T13:44:25.6432801Z [rank3]:[W1204 13:22:37.703627246 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6432989Z [rank1]:[W1204 13:22:38.341849794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6433163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6433419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6433581Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6433977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6434179Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6434282Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6434378Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6434473Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6434474Z 2025-12-04T13:44:25.6434708Z [rank1]:[W1204 13:22:38.344111784 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6434880Z [rank2]:[W1204 13:22:38.703163129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6435055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6435308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6435472Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6435846Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6436050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6436154Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6436249Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6436346Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6436348Z 2025-12-04T13:44:25.6436580Z [rank2]:[W1204 13:22:38.705113445 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6436771Z [rank3]:[W1204 13:22:38.703742926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6436947Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6437203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6437366Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6437761Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6437993Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6438097Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6438193Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6438289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6438291Z 2025-12-04T13:44:25.6438523Z [rank3]:[W1204 13:22:38.705978646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6438695Z [rank1]:[W1204 13:22:39.344301003 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6438869Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6439124Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6439285Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6439650Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6439855Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6439959Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6440054Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6440149Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6440151Z 2025-12-04T13:44:25.6440385Z [rank1]:[W1204 13:22:39.346476784 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6440556Z [rank2]:[W1204 13:22:39.705257605 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6440756Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6441010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6441175Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6441542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6441774Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6441879Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6441973Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6442071Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6442073Z 2025-12-04T13:44:25.6442305Z [rank2]:[W1204 13:22:39.707638572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6442476Z [rank3]:[W1204 13:22:39.706084917 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6442652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6442908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6443072Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6443438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6443642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6443748Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6443844Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6443942Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6443943Z 2025-12-04T13:44:25.6444175Z [rank3]:[W1204 13:22:39.708345897 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6444347Z [rank1]:[W1204 13:22:40.346594465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6444522Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6444801Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6444963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6445329Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6445542Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6445656Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6445752Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6445847Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6445849Z 2025-12-04T13:44:25.6446082Z [rank1]:[W1204 13:22:40.347760949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6446250Z [rank2]:[W1204 13:22:40.707808493 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6446425Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6446685Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6446854Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6447224Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6447427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6447577Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6447673Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6447772Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6447773Z 2025-12-04T13:44:25.6447947Z [rank3]:[W1204 13:22:40.708456378 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6448121Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6448379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6448546Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6448946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6449150Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6449255Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6449350Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6449473Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6449475Z 2025-12-04T13:44:25.6449713Z [rank2]:[W1204 13:22:40.710920023 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6449947Z [rank3]:[W1204 13:22:40.710922263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6450125Z [rank1]:[W1204 13:22:41.347923570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6450302Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6450563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6450729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6451099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6451301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6451405Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6451505Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6451602Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6451604Z 2025-12-04T13:44:25.6451840Z [rank1]:[W1204 13:22:41.349274900 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6452011Z [rank2]:[W1204 13:22:41.711051545 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6452185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6452440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6452624Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6452997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6453199Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6453304Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6453420Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6453519Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6453522Z 2025-12-04T13:44:25.6453692Z [rank3]:[W1204 13:22:41.711065045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6453868Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6454123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6454284Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6454653Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6454856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6454962Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6455057Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6455154Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6455156Z 2025-12-04T13:44:25.6455392Z [rank2]:[W1204 13:22:41.712769007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6455622Z [rank3]:[W1204 13:22:41.712769087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6455793Z [rank1]:[W1204 13:22:42.349674967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6455968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6456227Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6456394Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6456795Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6456997Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6457101Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6457196Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6457314Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6457316Z 2025-12-04T13:44:25.6457582Z [rank1]:[W1204 13:22:42.351907307 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6457756Z [rank2]:[W1204 13:22:42.712888080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6457931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6458188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6458354Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6458724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6458925Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6459030Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6459125Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6459222Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6459225Z 2025-12-04T13:44:25.6459399Z [rank3]:[W1204 13:22:42.712888060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6459574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6459834Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6459999Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6460374Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6460611Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6460717Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6460812Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6460910Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6460912Z 2025-12-04T13:44:25.6461149Z [rank2]:[W1204 13:22:42.715071771 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6461407Z [rank3]:[W1204 13:22:42.715071691 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6461582Z [rank1]:[W1204 13:22:43.352046380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6461759Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6462016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6462178Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6462551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6462755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6462858Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6462956Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6463051Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6463053Z 2025-12-04T13:44:25.6463288Z [rank1]:[W1204 13:22:43.353274133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6463463Z [rank2]:[W1204 13:22:43.715205895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6463640Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6463899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6464063Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6464450Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6464653Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6464758Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6464853Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6464952Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6464953Z 2025-12-04T13:44:25.6465188Z [rank2]:[W1204 13:22:43.717151642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6465378Z [rank3]:[W1204 13:22:43.715383891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6465554Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6465813Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6465978Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6466346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6466552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6466658Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6466753Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6466849Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6466851Z 2025-12-04T13:44:25.6467083Z [rank3]:[W1204 13:22:43.718017362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6467257Z [rank1]:[W1204 13:22:44.353389127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6467430Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6467724Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6467886Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6468255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6468486Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6468591Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6468687Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6468782Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6468784Z 2025-12-04T13:44:25.6469017Z [rank1]:[W1204 13:22:44.354771726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6469199Z [rank2]:[W1204 13:22:44.717310755 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6469388Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6469643Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6469806Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6470176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6470381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6470486Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6470580Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6470677Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6470679Z 2025-12-04T13:44:25.6470911Z [rank2]:[W1204 13:22:44.719244472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6471080Z [rank3]:[W1204 13:22:44.718126387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6471257Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6471512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6471675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6472038Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6472242Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6472372Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6472468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6472565Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6472567Z 2025-12-04T13:44:25.6472800Z [rank3]:[W1204 13:22:44.720199071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6472970Z [rank1]:[W1204 13:22:45.354936060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6473163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6473419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6473580Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6473948Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6474150Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6474257Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6474354Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6474450Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6474452Z 2025-12-04T13:44:25.6474685Z [rank1]:[W1204 13:22:45.356241951 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6474855Z [rank2]:[W1204 13:22:45.719356727 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6475030Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6475288Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6475450Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6475818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6476018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6476124Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6476238Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6476336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6476338Z 2025-12-04T13:44:25.6476571Z [rank2]:[W1204 13:22:45.720674718 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6476742Z [rank3]:[W1204 13:22:45.720326386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6476916Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6477193Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6477356Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6477755Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6477958Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6478065Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6478160Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6478257Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6478259Z 2025-12-04T13:44:25.6478490Z [rank3]:[W1204 13:22:45.721503890 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6478660Z [rank1]:[W1204 13:22:46.356444385 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6478835Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6479094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6479261Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6479626Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6479828Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6479931Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6480028Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6480148Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6480150Z 2025-12-04T13:44:25.6480383Z [rank1]:[W1204 13:22:46.358614297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6480554Z [rank2]:[W1204 13:22:46.720808064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6480728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6480987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6481179Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6481547Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6481749Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6481854Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6481951Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6482049Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6482051Z 2025-12-04T13:44:25.6482286Z [rank2]:[W1204 13:22:46.722044336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6482454Z [rank3]:[W1204 13:22:46.721605866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6482627Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6482882Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6483048Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6483419Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6483620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6483725Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6483820Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6483918Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6483921Z 2025-12-04T13:44:25.6484172Z [rank3]:[W1204 13:22:46.722772250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6484342Z [rank1]:[W1204 13:22:47.358777132 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6484515Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6484770Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6484959Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6485331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6485534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6485639Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6485736Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6485833Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6485835Z 2025-12-04T13:44:25.6486069Z [rank1]:[W1204 13:22:47.361023762 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6486239Z [rank2]:[W1204 13:22:47.722107474 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6486412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6486668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6486831Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6487201Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6487402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6487544Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6487641Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6487741Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6487744Z 2025-12-04T13:44:25.6488005Z [rank2]:[W1204 13:22:47.723257479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6488173Z [rank3]:[W1204 13:22:47.722853268 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6488346Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6488600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6488763Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6489153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6489354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6489459Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6489554Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6489652Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6489654Z 2025-12-04T13:44:25.6489891Z [rank3]:[W1204 13:22:47.724388604 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6490062Z [rank1]:[W1204 13:22:48.361211628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6490235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6490488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6490653Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6491019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6491220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6491323Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6491419Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6491514Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6491516Z 2025-12-04T13:44:25.6491749Z [rank1]:[W1204 13:22:48.363247343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6491943Z [rank2]:[W1204 13:22:48.723410816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6492118Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6492374Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6492535Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6492902Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6493123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6493227Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6493322Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6493418Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6493420Z 2025-12-04T13:44:25.6493653Z [rank2]:[W1204 13:22:48.725193726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6493825Z [rank3]:[W1204 13:22:48.724507161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6493999Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6494256Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6494422Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6494787Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6494991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6495096Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6495192Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6495289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6495291Z 2025-12-04T13:44:25.6495523Z [rank3]:[W1204 13:22:48.725895380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6495694Z [rank1]:[W1204 13:22:49.363396600 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6495886Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6496144Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6496309Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6496673Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6496899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6497002Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6497097Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6497192Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6497194Z 2025-12-04T13:44:25.6497428Z [rank1]:[W1204 13:22:49.364797469 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6497646Z [rank2]:[W1204 13:22:49.725339084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6497823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6498080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6498243Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6498615Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6498817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6498923Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6499019Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6499115Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6499117Z 2025-12-04T13:44:25.6499350Z [rank2]:[W1204 13:22:49.726671204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6499518Z [rank3]:[W1204 13:22:49.726014829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6499693Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6499973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6500137Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6500509Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6500742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6500851Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6500947Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6501043Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6501045Z 2025-12-04T13:44:25.6501278Z [rank3]:[W1204 13:22:49.727261911 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6501449Z [rank1]:[W1204 13:22:50.364999586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6501624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6501878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6502040Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6502404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6502607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6502711Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6502809Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6502904Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6502905Z 2025-12-04T13:44:25.6503141Z [rank1]:[W1204 13:22:50.366467163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6503312Z [rank2]:[W1204 13:22:50.726810783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6503485Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6503765Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6503927Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6504293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6504494Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6504620Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6504717Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6504814Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6504816Z 2025-12-04T13:44:25.6505050Z [rank2]:[W1204 13:22:50.728245291 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6505219Z [rank3]:[W1204 13:22:50.727422269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6505392Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6505650Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6505815Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6506180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6506382Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6506487Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6506581Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6506684Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6506686Z 2025-12-04T13:44:25.6506918Z [rank3]:[W1204 13:22:50.729477803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6507089Z [rank1]:[W1204 13:22:51.366664101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6507263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6507564Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6507754Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6508122Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6508322Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6508425Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6508547Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6508644Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6508646Z 2025-12-04T13:44:25.6508880Z [rank1]:[W1204 13:22:51.368597098 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6509049Z [rank2]:[W1204 13:22:51.728364501 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6509221Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6509476Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6509642Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6510011Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6510213Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6510318Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6510415Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6510512Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6510514Z 2025-12-04T13:44:25.6510748Z [rank2]:[W1204 13:22:51.730266978 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6510918Z [rank3]:[W1204 13:22:51.729617723 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6511094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6511349Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6511514Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6511899Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6512099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6512204Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6512300Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6512418Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6512420Z 2025-12-04T13:44:25.6512654Z [rank3]:[W1204 13:22:51.731795884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6512825Z [rank1]:[W1204 13:22:52.368756857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6512998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6513253Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6513417Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6513783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6513985Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6514088Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6514184Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6514281Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6514285Z 2025-12-04T13:44:25.6514522Z [rank1]:[W1204 13:22:52.370022049 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6514693Z [rank2]:[W1204 13:22:52.730383319 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6514868Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6515122Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6515285Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6515678Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6515883Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6515987Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6516083Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6516179Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6516202Z 2025-12-04T13:44:25.6516438Z [rank2]:[W1204 13:22:52.732571470 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6516608Z [rank3]:[W1204 13:22:52.731894395 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6516783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6517037Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6517202Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6517606Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6517805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6517909Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6518003Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6518099Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6518102Z 2025-12-04T13:44:25.6518335Z [rank3]:[W1204 13:22:52.733694925 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6518506Z [rank1]:[W1204 13:22:53.370194549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6518680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6518935Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6519098Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6519491Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6519693Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6519796Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6519893Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6519989Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6519992Z 2025-12-04T13:44:25.6520225Z [rank1]:[W1204 13:22:53.371444451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6520422Z [rank2]:[W1204 13:22:53.732721771 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6520595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6520851Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6521014Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6521381Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6521588Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6521691Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6521787Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6521883Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6521885Z 2025-12-04T13:44:25.6522117Z [rank2]:[W1204 13:22:53.735044939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6522287Z [rank3]:[W1204 13:22:53.733829426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6522462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6522716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6522878Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6523251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6523473Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6523579Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6523673Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6523771Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6523772Z 2025-12-04T13:44:25.6524007Z [rank3]:[W1204 13:22:53.735764723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6524196Z [rank1]:[W1204 13:22:54.371591653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6524371Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6524624Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6524786Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6525151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6525357Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6525462Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6525558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6525654Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6525657Z 2025-12-04T13:44:25.6525891Z [rank1]:[W1204 13:22:54.373767634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6526062Z [rank2]:[W1204 13:22:54.735423285 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6526238Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6526493Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6526654Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6527024Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6527228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6527350Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6527448Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6527582Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6527584Z 2025-12-04T13:44:25.6527820Z [rank2]:[W1204 13:22:54.737900030 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6527990Z [rank3]:[W1204 13:22:54.735899685 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6528192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6528446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6528610Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6528978Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6529181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6529288Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6529382Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6529478Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6529481Z 2025-12-04T13:44:25.6529713Z [rank3]:[W1204 13:22:54.738158655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6529883Z [rank1]:[W1204 13:22:55.373891447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6530059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6530312Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6530476Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6530840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6531042Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6531148Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6531276Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6531373Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6531376Z 2025-12-04T13:44:25.6531610Z [rank1]:[W1204 13:22:55.375135829 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6531779Z [rank3]:[W1204 13:22:55.738280497 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6531952Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6532230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6532392Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6532758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6532959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6533064Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6533160Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6533256Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6533258Z 2025-12-04T13:44:25.6533492Z [rank3]:[W1204 13:22:55.739535970 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6533664Z [rank2]:[W1204 13:22:55.738080212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6533839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6534094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6534257Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6534623Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6534824Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6534931Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6535025Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6535141Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6535143Z 2025-12-04T13:44:25.6535375Z [rank2]:[W1204 13:22:55.740427430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6535545Z [rank1]:[W1204 13:22:56.375706912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6535723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6535977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6536163Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6536530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6536732Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6536835Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6536933Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6537030Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6537032Z 2025-12-04T13:44:25.6537265Z [rank1]:[W1204 13:22:56.377677399 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6537435Z [rank3]:[W1204 13:22:56.739703062 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6537643Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6537898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6538062Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6538428Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6538628Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6538731Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6538829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6538927Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6538929Z 2025-12-04T13:44:25.6539188Z [rank3]:[W1204 13:22:56.741636209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6539358Z [rank2]:[W1204 13:22:56.740534364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6539533Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6539787Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6539979Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6540350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6540551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6540656Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6540750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6540849Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6540851Z 2025-12-04T13:44:25.6541084Z [rank2]:[W1204 13:22:56.742727225 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6541258Z [rank1]:[W1204 13:22:57.377851171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6541434Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6541686Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6541850Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6542221Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6542424Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6542528Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6542624Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6542721Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6542724Z 2025-12-04T13:44:25.6542976Z [rank1]:[W1204 13:22:57.379524284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6543147Z [rank3]:[W1204 13:22:57.741807202 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6543322Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6543577Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6543738Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6544139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6544340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6544443Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6544542Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6544638Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6544641Z 2025-12-04T13:44:25.6544878Z [rank3]:[W1204 13:22:57.743198791 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6545047Z [rank2]:[W1204 13:22:57.742822990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6545222Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6545474Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6545640Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6546010Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6546211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6546318Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6546415Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6546512Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6546514Z 2025-12-04T13:44:25.6546750Z [rank2]:[W1204 13:22:57.745057020 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6546941Z [rank1]:[W1204 13:22:58.379678178 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6547116Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6547368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6547580Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6547962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6548180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6548284Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6548380Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6548476Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6548478Z 2025-12-04T13:44:25.6548712Z [rank1]:[W1204 13:22:58.380914161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6548886Z [rank3]:[W1204 13:22:58.743364825 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6549060Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6549316Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6549477Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6549845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6550050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6550153Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6550249Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6550345Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6550346Z 2025-12-04T13:44:25.6550581Z [rank3]:[W1204 13:22:58.745299462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6550753Z [rank2]:[W1204 13:22:58.745191125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6550954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6551212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6551374Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6551741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6551967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6552073Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6552169Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6552266Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6552268Z 2025-12-04T13:44:25.6552501Z [rank2]:[W1204 13:22:58.747339937 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6552674Z [rank1]:[W1204 13:22:59.381052926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6552850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6553106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6553268Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6553633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6553837Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6553942Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6554037Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6554133Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6554135Z 2025-12-04T13:44:25.6554367Z [rank1]:[W1204 13:22:59.382302548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6554536Z [rank3]:[W1204 13:22:59.745505576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6554729Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6554987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6555149Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6555518Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6555746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6555850Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6555947Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6556042Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6556044Z 2025-12-04T13:44:25.6556278Z [rank3]:[W1204 13:22:59.747436393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6556446Z [rank2]:[W1204 13:22:59.747456183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6556623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6556879Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6557043Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6557414Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6557655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6557760Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6557856Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6557953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6557955Z 2025-12-04T13:44:25.6558192Z [rank2]:[W1204 13:22:59.749104366 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6558362Z [rank1]:[W1204 13:23:00.382434314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6558538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6558817Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6558979Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6559346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6559548Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6559689Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6559785Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6559882Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6559884Z 2025-12-04T13:44:25.6560115Z [rank1]:[W1204 13:23:00.384043258 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6560288Z [rank3]:[W1204 13:23:00.747633728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6560461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6560720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6560882Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6561249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6561455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6561560Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6561657Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6561752Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6561754Z 2025-12-04T13:44:25.6561987Z [rank3]:[W1204 13:23:00.749605704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6562156Z [rank2]:[W1204 13:23:00.749230493 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6562329Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6562584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6562767Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6563136Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6563337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6563461Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6563557Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6563656Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6563657Z 2025-12-04T13:44:25.6563891Z [rank2]:[W1204 13:23:00.751065622 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6564060Z [rank1]:[W1204 13:23:01.384168915 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6564234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6564487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6564653Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6565018Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6565221Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6565325Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6565421Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6565519Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6565522Z 2025-12-04T13:44:25.6565758Z [rank1]:[W1204 13:23:01.385627773 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6565930Z [rank3]:[W1204 13:23:01.749799050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6566103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6566359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6566543Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6566910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6567111Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6567214Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6567322Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6567428Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6567430Z 2025-12-04T13:44:25.6567702Z [rank3]:[W1204 13:23:01.751920133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6567872Z [rank2]:[W1204 13:23:01.751188369 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6568049Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6568305Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6568469Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6568836Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6569037Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6569141Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6569236Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6569336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6569338Z 2025-12-04T13:44:25.6569573Z [rank2]:[W1204 13:23:01.754066215 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6569742Z [rank1]:[W1204 13:23:02.385775470 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6569917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6570175Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6570339Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6570730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6570933Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6571037Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6571132Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6571229Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6571256Z 2025-12-04T13:44:25.6571489Z [rank1]:[W1204 13:23:02.387014903 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6571658Z [rank3]:[W1204 13:23:02.752101380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6571831Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6572088Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6572252Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6576271Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6576474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6576579Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6576676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6576772Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6576778Z 2025-12-04T13:44:25.6577013Z [rank3]:[W1204 13:23:02.754165074 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6577184Z [rank2]:[W1204 13:23:02.754150884 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6577358Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6577648Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6577811Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6578230Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6578432Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6578540Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6578635Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6578733Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6578735Z 2025-12-04T13:44:25.6578968Z [rank2]:[W1204 13:23:02.756583110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6579170Z [rank1]:[W1204 13:23:03.387119361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6579346Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6579600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6579764Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6580133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6580336Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6580441Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6580536Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6580633Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6580635Z 2025-12-04T13:44:25.6580867Z [rank1]:[W1204 13:23:03.388289395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6581041Z [rank3]:[W1204 13:23:03.754280693 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6581215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6581472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6581634Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6582007Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6582231Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6582337Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6582433Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6582528Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6582530Z 2025-12-04T13:44:25.6582765Z [rank3]:[W1204 13:23:03.756243779 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6582957Z [rank2]:[W1204 13:23:03.756675549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6583134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6583390Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6583552Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6583922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6584128Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6584234Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6584330Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6584428Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6584430Z 2025-12-04T13:44:25.6584664Z [rank2]:[W1204 13:23:03.758473839 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6584835Z [rank1]:[W1204 13:23:04.388451924 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6585011Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6585265Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6585429Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6585793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6586014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6586119Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6586217Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6586316Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6586318Z 2025-12-04T13:44:25.6586552Z [rank1]:[W1204 13:23:04.389926041 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6586723Z [rank3]:[W1204 13:23:04.756415747 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6586918Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6587175Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6587337Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6587752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6587957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6588062Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6588158Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6588253Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6588255Z 2025-12-04T13:44:25.6588489Z [rank3]:[W1204 13:23:04.758587569 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6588659Z [rank2]:[W1204 13:23:04.758544360 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6588837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6589093Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6589257Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6589627Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6589830Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6589960Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6590055Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6590152Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6590154Z 2025-12-04T13:44:25.6590386Z [rank2]:[W1204 13:23:04.760698202 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6590555Z [rank1]:[W1204 13:23:05.390154469 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6590755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6591013Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6591176Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6591543Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6591747Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6591854Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6591950Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6592047Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6592049Z 2025-12-04T13:44:25.6592282Z [rank1]:[W1204 13:23:05.392581915 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6592453Z [rank3]:[W1204 13:23:05.758753388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6592626Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6592885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6593047Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6593417Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6593619Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6593725Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6593845Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6593941Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6593943Z 2025-12-04T13:44:25.6594176Z [rank3]:[W1204 13:23:05.760995358 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6594345Z [rank2]:[W1204 13:23:05.760770224 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6594520Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6594798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6594961Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6595330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6595533Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6595640Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6595737Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6595835Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6595837Z 2025-12-04T13:44:25.6596072Z [rank2]:[W1204 13:23:05.762920996 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6596241Z [rank1]:[W1204 13:23:06.392775584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6596415Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6596668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6596834Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6597199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6597402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6597550Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6597647Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6597743Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6597771Z 2025-12-04T13:44:25.6598004Z [rank1]:[W1204 13:23:06.395206150 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6598176Z [rank3]:[W1204 13:23:06.761191298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6598348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6598603Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6598795Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6599165Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6599367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6599472Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6599568Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6599667Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6599668Z 2025-12-04T13:44:25.6599907Z [rank3]:[W1204 13:23:06.763445638 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6600078Z [rank2]:[W1204 13:23:06.763059756 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6600253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6600507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6600671Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6601039Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6601240Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6601344Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6601439Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6601539Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6601541Z 2025-12-04T13:44:25.6601794Z [rank2]:[W1204 13:23:06.764207191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6601966Z [rank1]:[W1204 13:23:07.395395440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6602141Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6602394Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6602578Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6602946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6603147Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6603252Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6603347Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6603443Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6603446Z 2025-12-04T13:44:25.6603680Z [rank1]:[W1204 13:23:07.397377236 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6603851Z [rank3]:[W1204 13:23:07.763592819 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6604024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6604281Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6604444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6604813Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6605014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6605119Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6605214Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6605309Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6605312Z 2025-12-04T13:44:25.6605545Z [rank3]:[W1204 13:23:07.765893978 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6605735Z [rank2]:[W1204 13:23:07.764323893 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6605909Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6606165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6606328Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6606724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6606925Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6607030Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6607125Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6607222Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6607224Z 2025-12-04T13:44:25.6607457Z [rank2]:[W1204 13:23:07.766464435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6607665Z [rank1]:[W1204 13:23:08.397567216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6607841Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6608095Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6608258Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6608634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6608839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6608943Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6609038Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6609134Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6609136Z 2025-12-04T13:44:25.6609369Z [rank1]:[W1204 13:23:08.399138921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6609568Z [rank3]:[W1204 13:23:08.766463750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6609742Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6609997Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6610163Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6610527Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6610761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6610865Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6610962Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6611058Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6611060Z 2025-12-04T13:44:25.6611294Z [rank3]:[W1204 13:23:08.768605543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6611467Z [rank2]:[W1204 13:23:08.766606447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6611641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6611897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6612060Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6612426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6612628Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6612733Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6612830Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6612927Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6612929Z 2025-12-04T13:44:25.6613163Z [rank2]:[W1204 13:23:08.768799138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6613334Z [rank1]:[W1204 13:23:09.399301873 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6613530Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6613784Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6613946Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6614310Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6614535Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6614641Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6614735Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6614832Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6614834Z 2025-12-04T13:44:25.6615067Z [rank1]:[W1204 13:23:09.401645821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6615238Z [rank3]:[W1204 13:23:09.768768685 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6615413Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6615670Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6615833Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6616198Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6616401Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6616506Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6616601Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6616696Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6616698Z 2025-12-04T13:44:25.6616931Z [rank3]:[W1204 13:23:09.770010017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6617101Z [rank2]:[W1204 13:23:09.768904632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6617277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6617588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6617751Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6618117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6618331Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6618450Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6618546Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6618643Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6618644Z 2025-12-04T13:44:25.6618878Z [rank2]:[W1204 13:23:09.770051906 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6619047Z [rank1]:[W1204 13:23:10.401812704 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6619223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6619481Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6619644Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6620010Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6620211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6620317Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6620413Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6620509Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6620511Z 2025-12-04T13:44:25.6620743Z [rank1]:[W1204 13:23:10.403059956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6620913Z [rank2]:[W1204 13:23:10.770184680 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6621086Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6621362Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6621527Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6621896Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6622098Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6622229Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6622324Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6622421Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6622423Z 2025-12-04T13:44:25.6622656Z [rank2]:[W1204 13:23:10.771411783 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6622828Z [rank3]:[W1204 13:23:10.770187530 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6623001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6623260Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6623424Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6623794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6623995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6624099Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6624197Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6624293Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6624295Z 2025-12-04T13:44:25.6624529Z [rank3]:[W1204 13:23:10.772171876 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6624698Z [rank1]:[W1204 13:23:11.403223050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6624874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6625128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6625311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6625679Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6625878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6625983Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6626096Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6626194Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6626197Z 2025-12-04T13:44:25.6626428Z [rank1]:[W1204 13:23:11.404456652 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6626598Z [rank2]:[W1204 13:23:11.771568537 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6626771Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6627025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6627190Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6627590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6627794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6627898Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6627994Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6628092Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6628093Z 2025-12-04T13:44:25.6628327Z [rank2]:[W1204 13:23:11.772797239 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6628498Z [rank3]:[W1204 13:23:11.772295011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6628671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6628926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6629090Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6629483Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6629685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6629792Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6629891Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6630011Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6630012Z 2025-12-04T13:44:25.6630247Z [rank3]:[W1204 13:23:11.773476634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6630417Z [rank1]:[W1204 13:23:12.404602397 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6630593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6630847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6631010Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6631379Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6631580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6631687Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6631782Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6631882Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6631886Z 2025-12-04T13:44:25.6632120Z [rank1]:[W1204 13:23:12.405850619 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6632291Z [rank2]:[W1204 13:23:12.772958784 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6632465Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6632722Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6632886Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6633274Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6633476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6633583Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6633680Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6633776Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6633798Z 2025-12-04T13:44:25.6634031Z [rank2]:[W1204 13:23:12.775202844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6634204Z [rank3]:[W1204 13:23:12.773586130 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6634378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6634633Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6634795Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6635169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6635371Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6635477Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6635576Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6635676Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6635677Z 2025-12-04T13:44:25.6635912Z [rank3]:[W1204 13:23:12.775558496 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6636086Z [rank1]:[W1204 13:23:13.406016064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6636260Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6636513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6636676Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6637067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6637269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6637376Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6637514Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6637614Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6637616Z 2025-12-04T13:44:25.6637851Z [rank1]:[W1204 13:23:13.408379521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6638053Z [rank3]:[W1204 13:23:13.775681712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6638229Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6638486Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6638649Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6639016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6639220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6639326Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6639426Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6639524Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6639527Z 2025-12-04T13:44:25.6639759Z [rank3]:[W1204 13:23:13.777235108 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6639931Z [rank2]:[W1204 13:23:13.775360790 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6640106Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6640364Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6640528Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6640896Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6641122Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6641229Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6641325Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6641423Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6641425Z 2025-12-04T13:44:25.6641658Z [rank2]:[W1204 13:23:13.777262097 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6641839Z [rank1]:[W1204 13:23:14.408579626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6642025Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6642279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6642442Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6642809Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6643015Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6643123Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6643219Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6643319Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6643321Z 2025-12-04T13:44:25.6643553Z [rank1]:[W1204 13:23:14.411129610 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6643725Z [rank2]:[W1204 13:23:14.777378214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6643902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6644157Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6644323Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6644694Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6644899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6645023Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6645120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6645218Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6645221Z 2025-12-04T13:44:25.6645392Z [rank3]:[W1204 13:23:14.777378274 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6645567Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6645824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6646011Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6646375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6646577Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6646681Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6646780Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6646876Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6646879Z 2025-12-04T13:44:25.6647112Z [rank2]:[W1204 13:23:14.779681233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6647342Z [rank3]:[W1204 13:23:14.779683203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6647539Z [rank1]:[W1204 13:23:15.411314536 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6647715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6647972Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6648136Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6648505Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6648705Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6648815Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6648939Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6649040Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6649041Z 2025-12-04T13:44:25.6649273Z [rank1]:[W1204 13:23:15.413034117 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6649444Z [rank2]:[W1204 13:23:15.779792171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6649618Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6649900Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6650065Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6650434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6650637Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6650745Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6650844Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6650940Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6650944Z 2025-12-04T13:44:25.6651178Z [rank2]:[W1204 13:23:15.781548422 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6651349Z [rank3]:[W1204 13:23:15.779794251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6651523Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6651780Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6651942Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6652307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6652509Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6652616Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6652712Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6652835Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6652837Z 2025-12-04T13:44:25.6653070Z [rank3]:[W1204 13:23:15.781794266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6653239Z [rank1]:[W1204 13:23:16.413253033 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6653413Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6653670Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6653856Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6654224Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6654426Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6654532Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6654630Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6654729Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6654730Z 2025-12-04T13:44:25.6654963Z [rank1]:[W1204 13:23:16.414934706 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6655136Z [rank2]:[W1204 13:23:16.781726559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6655309Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6655563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6655729Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6656101Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6656304Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6656408Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6656503Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6656603Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6656605Z 2025-12-04T13:44:25.6656861Z [rank2]:[W1204 13:23:16.783927260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6657032Z [rank3]:[W1204 13:23:16.781940614 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6657206Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6657464Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6657691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6658059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6658260Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6658365Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6658461Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6658558Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6658560Z 2025-12-04T13:44:25.6658795Z [rank3]:[W1204 13:23:16.784143905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6658966Z [rank1]:[W1204 13:23:17.415094764 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6659140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6659393Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6659555Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6659923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6660123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6660228Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6660322Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6660419Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6660422Z 2025-12-04T13:44:25.6660679Z [rank1]:[W1204 13:23:17.417221407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6660851Z [rank2]:[W1204 13:23:17.784083898 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6661027Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6661282Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6661445Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6661840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6662041Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6662145Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6662241Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6662338Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6662341Z 2025-12-04T13:44:25.6662574Z [rank2]:[W1204 13:23:17.786153012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6662745Z [rank3]:[W1204 13:23:17.784266354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6662919Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6663177Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6663340Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6663712Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6663914Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6664019Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6664115Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6664210Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6664212Z 2025-12-04T13:44:25.6664448Z [rank3]:[W1204 13:23:17.786473835 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6664640Z [rank1]:[W1204 13:23:18.417406045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6664815Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6665070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6665232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6665601Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6665828Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6665935Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6666029Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6666125Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6666127Z 2025-12-04T13:44:25.6666358Z [rank1]:[W1204 13:23:18.419522638 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6666533Z [rank2]:[W1204 13:23:18.786262252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6666708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6666962Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6667124Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6667530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6667737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6667841Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6667936Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6668033Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6668035Z 2025-12-04T13:44:25.6668267Z [rank2]:[W1204 13:23:18.788537532 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6668439Z [rank3]:[W1204 13:23:18.786595105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6668641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6668898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6669061Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6669427Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6669654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6669759Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6669856Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6669952Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6669954Z 2025-12-04T13:44:25.6670188Z [rank3]:[W1204 13:23:18.788658659 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6670360Z [rank1]:[W1204 13:23:19.419854113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6670536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6670794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6670957Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6671326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6671529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6671635Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6671730Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6671826Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6671828Z 2025-12-04T13:44:25.6672063Z [rank1]:[W1204 13:23:19.422489525 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6672232Z [rank2]:[W1204 13:23:19.788691461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6672408Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6672680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6672843Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6673208Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6673430Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6673537Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6673632Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6673731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6673732Z 2025-12-04T13:44:25.6673965Z [rank2]:[W1204 13:23:19.790040661 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6674139Z [rank3]:[W1204 13:23:19.788806389 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6674314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6674572Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6674734Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6675100Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6675301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6675406Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6675504Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6675601Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6675602Z 2025-12-04T13:44:25.6675836Z [rank3]:[W1204 13:23:19.790595709 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6676007Z [rank1]:[W1204 13:23:20.422633255 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6676183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6676463Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6676625Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6676992Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6677194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6677319Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6677416Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6677545Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6677546Z 2025-12-04T13:44:25.6677780Z [rank1]:[W1204 13:23:20.423878307 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6677949Z [rank2]:[W1204 13:23:20.790172622 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6678122Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6678380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6678544Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6678910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6679113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6679219Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6679315Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6679414Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6679416Z 2025-12-04T13:44:25.6679650Z [rank2]:[W1204 13:23:20.792259916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6679820Z [rank3]:[W1204 13:23:20.791476433 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6679993Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6680250Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6680442Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6680815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6681017Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6681121Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6681248Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6681345Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6681347Z 2025-12-04T13:44:25.6681579Z [rank3]:[W1204 13:23:20.793724813 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6681748Z [rank1]:[W1204 13:23:21.424028269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6681921Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6682174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6682338Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6682702Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6682906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6683011Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6683108Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6683205Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6683207Z 2025-12-04T13:44:25.6683443Z [rank1]:[W1204 13:23:21.425837338 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6683612Z [rank2]:[W1204 13:23:21.792433837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6683787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6684040Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6684205Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6684591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6684794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6684899Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6684995Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6685115Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6685117Z 2025-12-04T13:44:25.6685351Z [rank2]:[W1204 13:23:21.794885643 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6685522Z [rank3]:[W1204 13:23:21.793866055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6685695Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6685952Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6686115Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6686485Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6686686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6686792Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6686887Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6686984Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6686987Z 2025-12-04T13:44:25.6687225Z [rank3]:[W1204 13:23:21.795895950 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6687397Z [rank1]:[W1204 13:23:22.426041509 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6687610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6687867Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6688028Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6688422Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6688622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6688728Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6688823Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6688920Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6688953Z 2025-12-04T13:44:25.6689188Z [rank1]:[W1204 13:23:22.427830259 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6689360Z [rank2]:[W1204 13:23:22.795061594 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6689536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6689793Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6689957Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6690325Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6690527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6690633Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6690727Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6690824Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6690826Z 2025-12-04T13:44:25.6691060Z [rank2]:[W1204 13:23:22.797316654 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6691233Z [rank3]:[W1204 13:23:22.796033803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6691407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6691668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6691831Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6692218Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6692421Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6692525Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6692622Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6692718Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6692720Z 2025-12-04T13:44:25.6692954Z [rank3]:[W1204 13:23:22.798225934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6693145Z [rank1]:[W1204 13:23:23.427995202 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6693319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6693575Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6693736Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6694105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6694307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6694413Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6694508Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6694605Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6694607Z 2025-12-04T13:44:25.6694841Z [rank1]:[W1204 13:23:23.430392608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6695013Z [rank2]:[W1204 13:23:23.797472027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6695188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6695441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6695605Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6695975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6696212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6696318Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6696414Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6696512Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6696514Z 2025-12-04T13:44:25.6696747Z [rank2]:[W1204 13:23:23.799355505 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6696942Z [rank3]:[W1204 13:23:23.798370337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6697117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6697372Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6697572Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6697937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6698144Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6698248Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6698343Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6698438Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6698440Z 2025-12-04T13:44:25.6698673Z [rank3]:[W1204 13:23:23.800591337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6698842Z [rank1]:[W1204 13:23:24.430561571 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6699020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6699274Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6699435Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6699803Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6700006Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6700140Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6700236Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6700335Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6700337Z 2025-12-04T13:44:25.6700572Z [rank1]:[W1204 13:23:24.432174645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6700741Z [rank2]:[W1204 13:23:24.799464549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6700942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6701198Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6701362Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6701731Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6701935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6702042Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6702138Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6702236Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6702238Z 2025-12-04T13:44:25.6702476Z [rank2]:[W1204 13:23:24.801563502 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6702648Z [rank3]:[W1204 13:23:24.800745771 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6702823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6703080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6703244Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6703609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6703811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6703916Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6704032Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6704129Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6704131Z 2025-12-04T13:44:25.6704365Z [rank3]:[W1204 13:23:24.803275874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6704538Z [rank1]:[W1204 13:23:25.432332559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6704713Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6704991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6705154Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6705523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6705724Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6705832Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6705928Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6706025Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6706027Z 2025-12-04T13:44:25.6706260Z [rank1]:[W1204 13:23:25.433585131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6706429Z [rank2]:[W1204 13:23:25.801680527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6706603Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6706863Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6707026Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6707392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6707642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6707749Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6707844Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6707969Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6707971Z 2025-12-04T13:44:25.6708203Z [rank2]:[W1204 13:23:25.803596555 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6708374Z [rank3]:[W1204 13:23:25.803397239 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6708548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6708804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6708993Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6709361Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6709562Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6709666Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6709764Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6709861Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6709863Z 2025-12-04T13:44:25.6710095Z [rank3]:[W1204 13:23:25.806193467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6710267Z [rank1]:[W1204 13:23:26.433714686 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6710440Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6710695Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6710860Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6711236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6711437Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6711542Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6711641Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6711736Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6711738Z 2025-12-04T13:44:25.6711998Z [rank1]:[W1204 13:23:26.434912869 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6712167Z [rank2]:[W1204 13:23:26.803686071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6712341Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6712594Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6712776Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6713143Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6713345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6713450Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6713546Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6713646Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6713648Z 2025-12-04T13:44:25.6713882Z [rank2]:[W1204 13:23:26.805815144 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6714053Z [rank3]:[W1204 13:23:26.806270623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6714228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6714484Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6714651Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6715019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6715220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6715324Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6715420Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6715516Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6715519Z 2025-12-04T13:44:25.6715805Z [rank3]:[W1204 13:23:26.807860458 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6715977Z [rank1]:[W1204 13:23:27.435066275 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6716151Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6716408Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6716571Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6716963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6717163Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6717269Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6717364Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6717460Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6717463Z 2025-12-04T13:44:25.6717746Z [rank1]:[W1204 13:23:27.436299387 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6717921Z [rank2]:[W1204 13:23:27.805955219 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6718097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6718354Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6718517Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6718887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6719088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6719193Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6719288Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6719386Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6719387Z 2025-12-04T13:44:25.6719622Z [rank2]:[W1204 13:23:27.808012134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6719819Z [rank3]:[W1204 13:23:27.808024723 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6719994Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6720252Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6720414Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6720796Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6721012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6721116Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6721211Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6721307Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6721308Z 2025-12-04T13:44:25.6721541Z [rank3]:[W1204 13:23:27.810145406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6721714Z [rank1]:[W1204 13:23:28.436426084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6721888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6722146Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6722309Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6722676Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6722882Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6722987Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6723083Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6723178Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6723179Z 2025-12-04T13:44:25.6723412Z [rank1]:[W1204 13:23:28.437762104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6723584Z [rank2]:[W1204 13:23:28.808120441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6723780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6724035Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6725573Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6725950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6726189Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6726294Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6726390Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6726487Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6726489Z 2025-12-04T13:44:25.6726738Z [rank2]:[W1204 13:23:28.810035668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6726910Z [rank3]:[W1204 13:23:28.810269633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6727087Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6727342Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6727542Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6727910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6728117Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6728224Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6728319Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6728417Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6728419Z 2025-12-04T13:44:25.6728654Z [rank3]:[W1204 13:23:28.812206770 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6728824Z [rank1]:[W1204 13:23:29.437904621 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6729017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6729271Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6729435Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6729858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6730092Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6730199Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6730294Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6730392Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6730394Z 2025-12-04T13:44:25.6730627Z [rank1]:[W1204 13:23:29.439281840 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6730799Z [rank2]:[W1204 13:23:29.810162266 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6730976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6731232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6731395Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6731763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6731966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6732070Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6732167Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6732264Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6732265Z 2025-12-04T13:44:25.6732501Z [rank2]:[W1204 13:23:29.811871718 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6732675Z [rank3]:[W1204 13:23:29.812334258 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6732851Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6733116Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6733279Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6733661Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6733862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6733988Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6734083Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6734181Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6734183Z 2025-12-04T13:44:25.6734417Z [rank3]:[W1204 13:23:29.814361603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6734591Z [rank1]:[W1204 13:23:30.439453237 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6734769Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6735030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6735194Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6735560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6735761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6735869Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6735966Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6736063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6736065Z 2025-12-04T13:44:25.6736298Z [rank1]:[W1204 13:23:30.440702310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6736471Z [rank2]:[W1204 13:23:30.811967017 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6736644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6736905Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6737080Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6737448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6737697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6737831Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6737927Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6738024Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6738026Z 2025-12-04T13:44:25.6738258Z [rank2]:[W1204 13:23:30.813852855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6738427Z [rank3]:[W1204 13:23:30.814450962 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6738602Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6738861Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6739028Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6739398Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6739600Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6739705Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6739802Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6739899Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6739901Z 2025-12-04T13:44:25.6740134Z [rank3]:[W1204 13:23:30.815920119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6740305Z [rank1]:[W1204 13:23:31.440907787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6740480Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6740734Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6740910Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6741279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6741498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6741604Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6741715Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6741822Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6741824Z 2025-12-04T13:44:25.6742058Z [rank1]:[W1204 13:23:31.442730796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6742229Z [rank2]:[W1204 13:23:31.814113291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6742403Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6742663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6742828Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6743196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6743402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6743508Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6743604Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6743702Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6743704Z 2025-12-04T13:44:25.6743940Z [rank2]:[W1204 13:23:31.815866442 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6744110Z [rank3]:[W1204 13:23:31.816062248 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6744283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6744540Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6744704Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6745080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6745282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6745398Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6745493Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6745590Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6745614Z 2025-12-04T13:44:25.6745850Z [rank3]:[W1204 13:23:31.817712841 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6746019Z [rank1]:[W1204 13:23:32.442928594 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6746194Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6746450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6746613Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6746981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6747183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6747289Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6747385Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6747512Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6747516Z 2025-12-04T13:44:25.6747751Z [rank1]:[W1204 13:23:32.445480487 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6747926Z [rank2]:[W1204 13:23:32.815959152 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6748100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6748356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6748520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6748905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6749106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6749213Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6749309Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6749419Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6749421Z 2025-12-04T13:44:25.6749670Z [rank2]:[W1204 13:23:32.817175645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6749854Z [rank3]:[W1204 13:23:32.817808011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6750028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6750286Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6750448Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6750816Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6751017Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6751123Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6751218Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6751315Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6751317Z 2025-12-04T13:44:25.6751552Z [rank3]:[W1204 13:23:32.819041484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6751724Z [rank1]:[W1204 13:23:33.445683525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6751900Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6752161Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6752328Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6752694Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6752906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6753012Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6753106Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6753202Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6753204Z 2025-12-04T13:44:25.6753447Z [rank1]:[W1204 13:23:33.448210699 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6753639Z [rank2]:[W1204 13:23:33.817322845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6753813Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6754069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6754232Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6754601Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6754807Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6754911Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6755007Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6755103Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6755105Z 2025-12-04T13:44:25.6755341Z [rank2]:[W1204 13:23:33.819472577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6755513Z [rank3]:[W1204 13:23:33.819163644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6755687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6755943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6756106Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6756475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6756695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6756800Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6756896Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6756993Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6756995Z 2025-12-04T13:44:25.6757239Z [rank3]:[W1204 13:23:33.821375315 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6757411Z [rank1]:[W1204 13:23:34.448398549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6757637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6757891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6758055Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6758422Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6758626Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6758732Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6758828Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6758924Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6758926Z 2025-12-04T13:44:25.6759161Z [rank1]:[W1204 13:23:34.450434393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6759331Z [rank2]:[W1204 13:23:34.819660087 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6759508Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6759763Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6759927Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6760295Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6760498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6760617Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6760714Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6760811Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6760813Z 2025-12-04T13:44:25.6761050Z [rank2]:[W1204 13:23:34.821593674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6761234Z [rank3]:[W1204 13:23:34.821555215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6761439Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6761696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6761858Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6762227Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6762428Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6762535Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6762631Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6762727Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6762729Z 2025-12-04T13:44:25.6762962Z [rank3]:[W1204 13:23:34.823269977 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6763131Z [rank3]:[W1204 13:23:35.239011937 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6763134Z 2025-12-04T13:44:25.6763297Z [rank0]:[W1204 13:23:35.259309386 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6763301Z 2025-12-04T13:44:25.6763461Z [rank2]:[W1204 13:23:35.288835929 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6763464Z 2025-12-04T13:44:25.6763622Z [rank1]:[W1204 13:23:35.317996091 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6763624Z 2025-12-04T13:44:25.6763794Z [rank1]:[W1204 13:23:35.450611834 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6763967Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6764224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6764400Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6764767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6764980Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6765085Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6765192Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6765299Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6765301Z 2025-12-04T13:44:25.6765539Z [rank1]:[W1204 13:23:35.451860166 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6765711Z [rank2]:[W1204 13:23:35.821743855 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6765884Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6766141Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6766306Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6766674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6766876Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6766985Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6767083Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6767182Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6767184Z 2025-12-04T13:44:25.6767420Z [rank2]:[W1204 13:23:35.823022757 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6767631Z [rank3]:[W1204 13:23:35.823401948 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6767806Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6768063Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6768228Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6768607Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6768808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6768926Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6769022Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6769120Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6769151Z 2025-12-04T13:44:25.6769384Z [rank3]:[W1204 13:23:35.824565572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6769554Z [rank1]:[W1204 13:23:36.452019457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6769729Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6769989Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6770155Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6770521Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6770723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6770826Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6770924Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6771019Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6771023Z 2025-12-04T13:44:25.6771256Z [rank1]:[W1204 13:23:36.453248410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6771429Z [rank2]:[W1204 13:23:36.823174139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6771603Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6771861Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6772025Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6772403Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6772605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6772709Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6772805Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6772912Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6772914Z 2025-12-04T13:44:25.6773157Z [rank2]:[W1204 13:23:36.824920380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6773338Z [rank3]:[W1204 13:23:36.824693855 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6773514Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6773772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6773936Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6774311Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6774513Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6774619Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6774715Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6774814Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6774816Z 2025-12-04T13:44:25.6775049Z [rank3]:[W1204 13:23:36.826586893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6775222Z [rank1]:[W1204 13:23:37.453547809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6775396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6775649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6775813Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6776180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6776394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6776498Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6776594Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6776690Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6776693Z 2025-12-04T13:44:25.6776936Z [rank1]:[W1204 13:23:37.455419198 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6777129Z [rank2]:[W1204 13:23:37.825053113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6777305Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6777601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6777766Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6778139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6778342Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6778446Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6778543Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6778639Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6778641Z 2025-12-04T13:44:25.6778875Z [rank2]:[W1204 13:23:37.826280356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6779047Z [rank3]:[W1204 13:23:37.826744575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6779221Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6779476Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6779641Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6780008Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6780227Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6780333Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6780428Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6780525Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6780527Z 2025-12-04T13:44:25.6780778Z [rank3]:[W1204 13:23:37.829068604 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6780949Z [rank1]:[W1204 13:23:38.455612640 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6781152Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6781407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6781570Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6781935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6782138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6782242Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6782338Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6782434Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6782438Z 2025-12-04T13:44:25.6782670Z [rank1]:[W1204 13:23:38.457883319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6782845Z [rank2]:[W1204 13:23:38.826431809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6783021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6783277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6783442Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6783813Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6784017Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6784131Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6784228Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6784325Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6784327Z 2025-12-04T13:44:25.6784560Z [rank2]:[W1204 13:23:38.828031833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6784740Z [rank3]:[W1204 13:23:38.829220447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6784940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6785199Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6785363Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6785731Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6785935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6786043Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6786138Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6786236Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6786237Z 2025-12-04T13:44:25.6786471Z [rank3]:[W1204 13:23:38.831227632 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6786645Z [rank1]:[W1204 13:23:39.458079632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6786820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6787076Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6787240Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6787642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6787844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6787950Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6788058Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6788154Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6788158Z 2025-12-04T13:44:25.6788391Z [rank1]:[W1204 13:23:39.460378271 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6788576Z [rank2]:[W1204 13:23:39.828139008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6788751Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6789030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6789194Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6789566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6789771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6789876Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6789973Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6790070Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6790072Z 2025-12-04T13:44:25.6790308Z [rank2]:[W1204 13:23:39.830225122 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6790480Z [rank3]:[W1204 13:23:39.831339607 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6790656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6790912Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6791079Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6791447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6791652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6791760Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6791857Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6791953Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6791964Z 2025-12-04T13:44:25.6792198Z [rank3]:[W1204 13:23:39.833485429 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6792372Z [rank1]:[W1204 13:23:40.460580634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6792561Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6792818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6793002Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6793368Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6793570Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6793674Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6793770Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6793870Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6793872Z 2025-12-04T13:44:25.6794106Z [rank1]:[W1204 13:23:40.462976001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6794277Z [rank2]:[W1204 13:23:40.830333717 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6794451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6794708Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6794873Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6795251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6795880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6796239Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6796486Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6796731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6796875Z 2025-12-04T13:44:25.6797118Z [rank2]:[W1204 13:23:40.832430241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6797611Z [rank3]:[W1204 13:23:40.833579445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6797992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6798474Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6798966Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6799529Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6800129Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6800470Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6800711Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6800942Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6801082Z 2025-12-04T13:44:25.6801320Z [rank3]:[W1204 13:23:40.835766837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6801769Z [rank1]:[W1204 13:23:41.463179105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6802162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6802627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6803082Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6803663Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6804286Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6804632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6804871Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6805101Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6805240Z 2025-12-04T13:44:25.6805497Z [rank1]:[W1204 13:23:41.465584111 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6805954Z [rank2]:[W1204 13:23:41.832527837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6806463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6806995Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6807451Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6808083Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6808691Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6809032Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6809272Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6809507Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6809644Z 2025-12-04T13:44:25.6809879Z [rank2]:[W1204 13:23:41.834193740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6810320Z [rank3]:[W1204 13:23:41.835871933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6810698Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6811163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6811617Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6812185Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6812785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6813127Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6813365Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6813597Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6813734Z 2025-12-04T13:44:25.6813968Z [rank3]:[W1204 13:23:41.837605304 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6814465Z [rank1]:[W1204 13:23:42.465780646 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6814845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6815308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6815773Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6816340Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6816969Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6817310Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6817594Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6817828Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6817965Z 2025-12-04T13:44:25.6818201Z [rank1]:[W1204 13:23:42.468134954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6818815Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T13:44:25.6819240Z if out == self.unknown_value: 2025-12-04T13:44:25.6824308Z [rank3]:[W1204 13:23:42.670745812 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6824511Z 2025-12-04T13:44:25.6824672Z [rank3]:[W1204 13:23:42.673157138 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6824868Z 2025-12-04T13:44:25.6825027Z [rank3]:[W1204 13:23:42.673232536 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6825225Z 2025-12-04T13:44:25.6825384Z [rank3]:[W1204 13:23:42.673449952 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6825578Z 2025-12-04T13:44:25.6825737Z [rank3]:[W1204 13:23:42.673503940 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6825931Z 2025-12-04T13:44:25.6826090Z [rank3]:[W1204 13:23:42.673696246 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6826284Z 2025-12-04T13:44:25.6826442Z [rank3]:[W1204 13:23:42.673764175 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6826637Z 2025-12-04T13:44:25.6826980Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T13:44:25.6827403Z if out == self.unknown_value: 2025-12-04T13:44:25.6827724Z [rank2]:[W1204 13:23:42.744264648 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6827922Z 2025-12-04T13:44:25.6828080Z [rank2]:[W1204 13:23:42.746459659 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6828276Z 2025-12-04T13:44:25.6828433Z [rank2]:[W1204 13:23:42.746560467 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6828629Z 2025-12-04T13:44:25.6828801Z [rank2]:[W1204 13:23:42.746791152 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6828996Z 2025-12-04T13:44:25.6829154Z [rank2]:[W1204 13:23:42.746848540 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6829376Z 2025-12-04T13:44:25.6829535Z [rank2]:[W1204 13:23:42.747038476 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6829729Z 2025-12-04T13:44:25.6829887Z [rank2]:[W1204 13:23:42.747089915 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6830082Z 2025-12-04T13:44:25.6830424Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T13:44:25.6830835Z if out == self.unknown_value: 2025-12-04T13:44:25.6831066Z [rank1]:[W1204 13:23:42.768275754 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6831265Z 2025-12-04T13:44:25.6831424Z [rank1]:[W1204 13:23:42.770444906 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6831619Z 2025-12-04T13:44:25.6831777Z [rank1]:[W1204 13:23:42.770526334 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6831974Z 2025-12-04T13:44:25.6832131Z [rank1]:[W1204 13:23:42.770728940 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6832327Z 2025-12-04T13:44:25.6832485Z [rank1]:[W1204 13:23:42.770780939 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6832682Z 2025-12-04T13:44:25.6832839Z [rank1]:[W1204 13:23:42.770960065 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6833036Z 2025-12-04T13:44:25.6833195Z [rank1]:[W1204 13:23:42.771013054 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6833390Z 2025-12-04T13:44:25.6833562Z [rank2]:[W1204 13:23:42.834329937 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6833943Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6834414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6834874Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6835456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6836065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6836411Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6836665Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6836900Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6837063Z 2025-12-04T13:44:25.6837299Z [rank2]:[W1204 13:23:42.836263464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6837792Z [rank3]:[W1204 13:23:42.837744631 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6838169Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6838632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6839091Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6839657Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6840255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6840593Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6840829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6841061Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6841197Z 2025-12-04T13:44:25.6841436Z [rank3]:[W1204 13:23:42.840192286 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6842046Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T13:44:25.6842460Z if out == self.unknown_value: 2025-12-04T13:44:25.6842695Z [rank0]:[W1204 13:23:43.938012763 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6842893Z 2025-12-04T13:44:25.6843052Z [rank0]:[W1204 13:23:43.940410819 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6843249Z 2025-12-04T13:44:25.6843406Z [rank0]:[W1204 13:23:43.940489698 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6843604Z 2025-12-04T13:44:25.6843791Z [rank0]:[W1204 13:23:43.940708773 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6843987Z 2025-12-04T13:44:25.6844143Z [rank0]:[W1204 13:23:43.940765402 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6844336Z 2025-12-04T13:44:25.6844492Z [rank0]:[W1204 13:23:43.940956877 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6844685Z 2025-12-04T13:44:25.6844856Z [rank0]:[W1204 13:23:43.941010456 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T13:44:25.6845050Z 2025-12-04T13:44:25.6845221Z [rank1]:[W1204 13:23:43.468314910 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6845630Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6846103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6846559Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6847129Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6847784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6848125Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6848361Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6848592Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6848731Z 2025-12-04T13:44:25.6848968Z [rank1]:[W1204 13:23:43.470347144 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6849404Z [rank2]:[W1204 13:23:43.836388971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6849783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6850244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6850699Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6851264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6851864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6852223Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6852461Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6852691Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6852827Z 2025-12-04T13:44:25.6853061Z [rank2]:[W1204 13:23:43.838514714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6853510Z [rank3]:[W1204 13:23:43.840300574 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6853919Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6854382Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6854831Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6855398Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6855999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6856341Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6856578Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6856807Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6856942Z 2025-12-04T13:44:25.6857176Z [rank3]:[W1204 13:23:43.842551634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6857645Z [rank1]:[W1204 13:23:44.470535441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6858017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6858485Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6858936Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6859503Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6860113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6860457Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6860712Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6860952Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6861095Z 2025-12-04T13:44:25.6861329Z [rank1]:[W1204 13:23:44.473015546 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6861977Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.6862405Z warnings.warn( 2025-12-04T13:44:25.6862673Z [rank2]:[W1204 13:23:44.838632742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6863062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6863525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6863976Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6864540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6865141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6865480Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6865716Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6865945Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6866082Z 2025-12-04T13:44:25.6866317Z [rank2]:[W1204 13:23:44.841023959 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6866752Z [rank3]:[W1204 13:23:44.842664742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6867132Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6867620Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6868085Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6868658Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6869265Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6869629Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6869869Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6870098Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6870235Z 2025-12-04T13:44:25.6870491Z [rank3]:[W1204 13:23:44.844817395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6871118Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.6871588Z warnings.warn( 2025-12-04T13:44:25.6871829Z [rank1]:[W1204 13:23:45.473177713 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6872210Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6872677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6873136Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6873713Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6874331Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6874682Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6874929Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6875162Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6875296Z 2025-12-04T13:44:25.6875531Z [rank1]:[W1204 13:23:45.474460645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6875967Z [rank2]:[W1204 13:23:45.841202586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6876350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6876821Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6877280Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6877890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6878509Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6878849Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6879093Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6879326Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6879476Z 2025-12-04T13:44:25.6879710Z [rank2]:[W1204 13:23:45.843307070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6880177Z [rank3]:[W1204 13:23:45.844971233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6880552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6881012Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6881460Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6882022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6882623Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6882961Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6883197Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6883425Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6883561Z 2025-12-04T13:44:25.6883796Z [rank3]:[W1204 13:23:45.847136324 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6884235Z [rank1]:[W1204 13:23:46.474624733 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6884610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6885071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6885520Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6886085Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6886697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6887035Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6887270Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6887540Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6887675Z 2025-12-04T13:44:25.6887922Z [rank1]:[W1204 13:23:46.476871393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6888356Z [rank2]:[W1204 13:23:46.843408729 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6888764Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6889229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6889680Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6890242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6890842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6891183Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6891418Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6891652Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6891788Z 2025-12-04T13:44:25.6892024Z [rank2]:[W1204 13:23:46.845720948 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6892460Z [rank3]:[W1204 13:23:46.847276034 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6892837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6893302Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6893754Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6894326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6894927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6895280Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6895516Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6895744Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6895880Z 2025-12-04T13:44:25.6896113Z [rank3]:[W1204 13:23:46.849381767 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6896557Z [rank1]:[W1204 13:23:47.477039462 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6896956Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6897417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6897904Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6898469Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6899070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6899412Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6899647Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6899876Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6900012Z 2025-12-04T13:44:25.6900247Z [rank1]:[W1204 13:23:47.479046807 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6900681Z [rank2]:[W1204 13:23:47.845943966 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6901058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6901522Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6901976Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6902538Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6903139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6903481Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6903732Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6903964Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6904100Z 2025-12-04T13:44:25.6904333Z [rank2]:[W1204 13:23:47.847499971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6904779Z [rank3]:[W1204 13:23:47.849490697 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6905155Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6905656Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6906108Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6906672Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6907274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6907636Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6907871Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6908102Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6908238Z 2025-12-04T13:44:25.6908474Z [rank3]:[W1204 13:23:47.851838025 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6908910Z [rank1]:[W1204 13:23:48.479199867 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6909288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6909755Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6910209Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6910771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6911371Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6911715Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6911955Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6912184Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6912340Z 2025-12-04T13:44:25.6912575Z [rank1]:[W1204 13:23:48.481495046 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6913010Z [rank2]:[W1204 13:23:48.847664131 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6913399Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6913862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6914339Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6914904Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6915504Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6915843Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6916078Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6916308Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6916443Z 2025-12-04T13:44:25.6916679Z [rank2]:[W1204 13:23:48.849944020 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6917118Z [rank3]:[W1204 13:23:48.851969675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6917507Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6917973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6918429Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6918994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6919594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6919934Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6920170Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6920401Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6920538Z 2025-12-04T13:44:25.6920787Z [rank3]:[W1204 13:23:48.854211046 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6921221Z [rank1]:[W1204 13:23:49.481679616 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6921597Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6922072Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6922553Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6923115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6923712Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6924052Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6924290Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6924520Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6924660Z 2025-12-04T13:44:25.6924894Z [rank1]:[W1204 13:23:49.483607623 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6925328Z [rank2]:[W1204 13:23:49.850101161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6925701Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6926162Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6926611Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6927178Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6927810Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6928148Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6928386Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6928616Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6928753Z 2025-12-04T13:44:25.6928988Z [rank2]:[W1204 13:23:49.852442009 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6929436Z [rank3]:[W1204 13:23:49.854307038 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6929816Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6930294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6930745Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6931338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6931939Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6932283Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6932519Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6932748Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6932884Z 2025-12-04T13:44:25.6933116Z [rank3]:[W1204 13:23:49.856623686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6933554Z [rank1]:[W1204 13:23:50.483809483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6933928Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6934388Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6934839Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6935400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6936001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6936341Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6936577Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6936808Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6936944Z 2025-12-04T13:44:25.6937177Z [rank1]:[W1204 13:23:50.486150881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6937672Z [rank2]:[W1204 13:23:50.852598901 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6938050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6938509Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6938980Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6939542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6940175Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6940514Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6940748Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6940978Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6941113Z 2025-12-04T13:44:25.6941348Z [rank2]:[W1204 13:23:50.854925889 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6941784Z [rank3]:[W1204 13:23:50.856744828 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6942159Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6942621Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6943068Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6943629Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6944228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6944568Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6944805Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6945038Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6945176Z 2025-12-04T13:44:25.6945410Z [rank3]:[W1204 13:23:50.859316121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6945847Z [rank1]:[W1204 13:23:51.486315243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6946234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6946698Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6947147Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6947767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6948394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6948733Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6948969Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6949197Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6949334Z 2025-12-04T13:44:25.6949568Z [rank1]:[W1204 13:23:51.488566733 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6950005Z [rank2]:[W1204 13:23:51.855079461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6950385Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6950850Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6951301Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6951865Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6952469Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6952810Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6953046Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6953276Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6953412Z 2025-12-04T13:44:25.6953647Z [rank2]:[W1204 13:23:51.857534026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6954081Z [rank3]:[W1204 13:23:51.859428574 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6954459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6954934Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6955388Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6955970Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6956585Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6956926Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6957161Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6957390Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6957646Z 2025-12-04T13:44:25.6957879Z [rank3]:[W1204 13:23:51.861520778 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6958314Z [rank1]:[W1204 13:23:52.488718235 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6958690Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6959154Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6959601Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6960165Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6960764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6961103Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6961339Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6961568Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6961704Z 2025-12-04T13:44:25.6961936Z [rank1]:[W1204 13:23:52.491074123 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6962371Z [rank2]:[W1204 13:23:52.857662440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6962747Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6963224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6963677Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6964253Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6964855Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6965225Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6965464Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6965694Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6965830Z 2025-12-04T13:44:25.6966065Z [rank2]:[W1204 13:23:52.860188704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6966504Z [rank3]:[W1204 13:23:52.861621242 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6966879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6967343Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6967832Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6968395Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6968992Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6969331Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6969568Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6969797Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6969933Z 2025-12-04T13:44:25.6970169Z [rank3]:[W1204 13:23:52.863771054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6970604Z [rank1]:[W1204 13:23:53.491226956 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6970979Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6971440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6971916Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6972476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6973087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6973426Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6973690Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6973918Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6974055Z 2025-12-04T13:44:25.6974287Z [rank1]:[W1204 13:23:53.492492138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6974723Z [rank2]:[W1204 13:23:53.860349927 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6975099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6975561Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6976010Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6976571Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6977169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6977564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6977801Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6978034Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6978171Z 2025-12-04T13:44:25.6978405Z [rank2]:[W1204 13:23:53.862806252 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6978838Z [rank3]:[W1204 13:23:53.863884228 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6979213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6979680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6980132Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6980708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6981308Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6981661Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6981899Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6982160Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6982295Z 2025-12-04T13:44:25.6982529Z [rank3]:[W1204 13:23:53.866122309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6982964Z [rank1]:[W1204 13:23:54.492659452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6983338Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6983798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6984251Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6984817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6985419Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6985758Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6985993Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6986222Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6986360Z 2025-12-04T13:44:25.6986594Z [rank1]:[W1204 13:23:54.494812964 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6987027Z [rank2]:[W1204 13:23:54.862963206 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6987403Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6987899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6988348Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6988934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6989534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6989874Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6990125Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6990355Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6990521Z 2025-12-04T13:44:25.6990754Z [rank2]:[W1204 13:23:54.865081059 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6991188Z [rank3]:[W1204 13:23:54.866234634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6991563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6992023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6992471Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6993034Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6993630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6993970Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6994204Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6994433Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6994568Z 2025-12-04T13:44:25.6994803Z [rank3]:[W1204 13:23:54.868738838 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6995241Z [rank1]:[W1204 13:23:55.494995268 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6995615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.6996074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.6996523Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6997095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.6997754Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.6998095Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.6998331Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6998577Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.6998713Z 2025-12-04T13:44:25.6998948Z [rank1]:[W1204 13:23:55.496548303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.6999419Z [rank2]:[W1204 13:23:55.865251974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.6999793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7000252Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7000704Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7001265Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7001867Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7002207Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7002442Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7002672Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7002809Z 2025-12-04T13:44:25.7003048Z [rank2]:[W1204 13:23:55.867367287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7003485Z [rank3]:[W1204 13:23:55.868827244 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7003860Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7004320Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7004769Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7005333Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7005948Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7006053Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7006150Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7006245Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7006247Z 2025-12-04T13:44:25.7006495Z [rank3]:[W1204 13:23:55.871204292 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7006690Z [rank1]:[W1204 13:23:56.496701389 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7006866Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7007122Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7007283Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7007691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7007895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7008000Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7008094Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7008191Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7008193Z 2025-12-04T13:44:25.7008426Z [rank1]:[W1204 13:23:56.497946621 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7008596Z [rank2]:[W1204 13:23:56.867520302 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7008772Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7009031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7009195Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7009561Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7009765Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7009886Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7009981Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7010078Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7010080Z 2025-12-04T13:44:25.7010325Z [rank2]:[W1204 13:23:56.869696284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7010496Z [rank3]:[W1204 13:23:56.871300288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7010698Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7010954Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7011117Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7011483Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7011687Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7011793Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7011889Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7011984Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7011986Z 2025-12-04T13:44:25.7012219Z [rank3]:[W1204 13:23:56.873557238 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7012389Z [rank1]:[W1204 13:23:57.498064068 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7012562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7012820Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7012981Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7013351Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7013552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7013659Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7013765Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7013863Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7013864Z 2025-12-04T13:44:25.7014096Z [rank1]:[W1204 13:23:57.499228422 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7014275Z [rank2]:[W1204 13:23:57.869821881 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7014450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7014725Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7014889Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7015255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7015458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7015564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7015658Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7015755Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7015757Z 2025-12-04T13:44:25.7015990Z [rank2]:[W1204 13:23:57.870997315 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7016161Z [rank3]:[W1204 13:23:57.873691175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7016334Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7016592Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7016755Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7017120Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7017321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7017424Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7017547Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7017660Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7017662Z 2025-12-04T13:44:25.7017897Z [rank3]:[W1204 13:23:57.875502565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7018066Z [rank1]:[W1204 13:23:58.499537415 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7018252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7018507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7018707Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7019073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7019274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7019380Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7019477Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7019572Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7019575Z 2025-12-04T13:44:25.7019807Z [rank1]:[W1204 13:23:58.500926504 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7019978Z [rank2]:[W1204 13:23:58.871180951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7020153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7020406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7020571Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7020938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7021138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7021243Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7021338Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7021436Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7021438Z 2025-12-04T13:44:25.7021681Z [rank2]:[W1204 13:23:58.873452660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7021851Z [rank3]:[W1204 13:23:58.875644282 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7022026Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7022295Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7022480Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7022846Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7023049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7023153Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7023249Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7023346Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7023347Z 2025-12-04T13:44:25.7023579Z [rank3]:[W1204 13:23:58.877889702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7023750Z [rank1]:[W1204 13:23:59.501124290 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7023922Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7024181Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7024344Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7024710Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7024909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7025014Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7025110Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7025205Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7025208Z 2025-12-04T13:44:25.7025450Z [rank1]:[W1204 13:23:59.502857842 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7025619Z [rank2]:[W1204 13:23:59.873647127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7025793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7026057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7026220Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7026610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7026813Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7026917Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7027013Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7027110Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7027114Z 2025-12-04T13:44:25.7027347Z [rank2]:[W1204 13:23:59.875918486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7027554Z [rank3]:[W1204 13:23:59.878123677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7027727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7027982Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7028146Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7028514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7028718Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7028822Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7028916Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7029014Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7029016Z 2025-12-04T13:44:25.7029248Z [rank3]:[W1204 13:23:59.880442936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7029437Z [rank1]:[W1204 13:24:00.503023050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7029611Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7029866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7030041Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7030408Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7030635Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7030740Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7030836Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7030932Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7030934Z 2025-12-04T13:44:25.7031168Z [rank1]:[W1204 13:24:00.504269172 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7031341Z [rank2]:[W1204 13:24:00.876056375 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7031514Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7031767Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7031931Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7032296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7032498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7032603Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7032698Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7032795Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7032797Z 2025-12-04T13:44:25.7033032Z [rank2]:[W1204 13:24:00.878475941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7033203Z [rank3]:[W1204 13:24:00.880533155 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7033386Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7033641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7033802Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7034185Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7034407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7034511Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7034606Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7034701Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7034703Z 2025-12-04T13:44:25.7034937Z [rank3]:[W1204 13:24:00.882749526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7035110Z [rank1]:[W1204 13:24:01.504432880 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7035287Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7035544Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7035706Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7036071Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7036273Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7036377Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7036472Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7036568Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7036570Z 2025-12-04T13:44:25.7036803Z [rank1]:[W1204 13:24:01.505685492 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7036972Z [rank2]:[W1204 13:24:01.878645620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7037146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7037410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7037606Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7037985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7038210Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7038316Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7038411Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7038508Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7038510Z 2025-12-04T13:44:25.7038743Z [rank2]:[W1204 13:24:01.880923279 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7038913Z [rank3]:[W1204 13:24:01.882861226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7039089Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7039343Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7039506Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7039871Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7040073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7040177Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7040273Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7040369Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7040372Z 2025-12-04T13:44:25.7040603Z [rank3]:[W1204 13:24:01.885096256 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7040774Z [rank1]:[W1204 13:24:02.505843901 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7040948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7041214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7041377Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7041757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7041960Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7042085Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7042181Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7042276Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7042278Z 2025-12-04T13:44:25.7042509Z [rank1]:[W1204 13:24:02.507510944 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7042677Z [rank2]:[W1204 13:24:02.881058269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7042852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7043107Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7043269Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7043635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7043837Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7043944Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7044039Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7044136Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7044137Z 2025-12-04T13:44:25.7044370Z [rank2]:[W1204 13:24:02.883264390 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7044539Z [rank3]:[W1204 13:24:02.885194767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7044713Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7044969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7045142Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7045507Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7045720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7045824Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7045941Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7046038Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7046041Z 2025-12-04T13:44:25.7046273Z [rank3]:[W1204 13:24:02.887565464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7046443Z [rank1]:[W1204 13:24:03.507650005 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7046616Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7046870Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7047036Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7047405Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7047657Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7047761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7047859Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7047954Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7047956Z 2025-12-04T13:44:25.7048192Z [rank1]:[W1204 13:24:03.508892497 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7048362Z [rank2]:[W1204 13:24:03.883445100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7048536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7048790Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7048955Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7049343Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7049545Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7049662Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7049759Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7049883Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7049884Z 2025-12-04T13:44:25.7050116Z [rank2]:[W1204 13:24:03.885663950 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7050287Z [rank3]:[W1204 13:24:03.887711345 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7050462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7050717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7050881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7051246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7051446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7051550Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7051647Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7051744Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7051746Z 2025-12-04T13:44:25.7051978Z [rank3]:[W1204 13:24:03.889997564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7052147Z [rank1]:[W1204 13:24:04.509048258 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7052321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7052578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7052743Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7053121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7053325Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7053428Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7053537Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7053633Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7053655Z 2025-12-04T13:44:25.7053888Z [rank1]:[W1204 13:24:04.510282050 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7054057Z [rank2]:[W1204 13:24:04.885792922 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7054230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7054483Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7054645Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7055021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7055220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7055325Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7055420Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7055518Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7055521Z 2025-12-04T13:44:25.7055755Z [rank2]:[W1204 13:24:04.887745658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7055926Z [rank3]:[W1204 13:24:04.890135115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7056100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7056354Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7056517Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7056895Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7057099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7057203Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7057300Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7057406Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7057409Z 2025-12-04T13:44:25.7057685Z [rank3]:[W1204 13:24:04.892604990 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7057882Z [rank1]:[W1204 13:24:05.510438491 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7058055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7058308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7058470Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7058836Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7059039Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7059143Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7059240Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7059337Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7059340Z 2025-12-04T13:44:25.7059575Z [rank1]:[W1204 13:24:05.511682274 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7059746Z [rank2]:[W1204 13:24:05.887869930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7059921Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7060175Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7060339Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7060706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7060926Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7061032Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7061126Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7061222Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7061224Z 2025-12-04T13:44:25.7061475Z [rank2]:[W1204 13:24:05.890316756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7061678Z [rank3]:[W1204 13:24:05.892702433 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7061853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7062106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7062268Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7062633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7062837Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7062942Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7063037Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7063133Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7063135Z 2025-12-04T13:44:25.7063368Z [rank3]:[W1204 13:24:05.894764167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7063541Z [rank1]:[W1204 13:24:06.511854055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7063716Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7063970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7064132Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7064502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7064716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7064819Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7064915Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7065012Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7065014Z 2025-12-04T13:44:25.7065260Z [rank1]:[W1204 13:24:06.513427010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7065430Z [rank2]:[W1204 13:24:06.890501207 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7065624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7065883Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7066045Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7066411Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7066612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7066719Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7066813Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7066911Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7066913Z 2025-12-04T13:44:25.7067145Z [rank2]:[W1204 13:24:06.892474364 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7067315Z [rank3]:[W1204 13:24:06.894920449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7067531Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7067786Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7067951Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7068317Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7068519Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7068623Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7068732Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7068829Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7068830Z 2025-12-04T13:44:25.7069062Z [rank3]:[W1204 13:24:06.896885296 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7069244Z [rank1]:[W1204 13:24:07.513565693 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7069417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7069696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7069857Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7070230Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7070432Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7070538Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7070634Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7070729Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7070731Z 2025-12-04T13:44:25.7070965Z [rank1]:[W1204 13:24:07.514805515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7071133Z [rank2]:[W1204 13:24:07.892611167 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7071308Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7071565Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7071727Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7072093Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7072294Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7072402Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7072498Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7072606Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7072608Z 2025-12-04T13:44:25.7072840Z [rank2]:[W1204 13:24:07.894542584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7073009Z [rank3]:[W1204 13:24:07.897010479 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7073194Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7073462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7073637Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7074001Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7074204Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7074308Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7074404Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7074501Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7074503Z 2025-12-04T13:44:25.7074736Z [rank3]:[W1204 13:24:07.899125942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7074906Z [rank1]:[W1204 13:24:08.514972168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7075079Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7075333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7075496Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7075864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7076069Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7076172Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7076270Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7076365Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7076367Z 2025-12-04T13:44:25.7076614Z [rank1]:[W1204 13:24:08.516377317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7076785Z [rank2]:[W1204 13:24:08.894665958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7076958Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7077222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7077411Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7077802Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7078003Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7078108Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7078202Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7078302Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7078303Z 2025-12-04T13:44:25.7078537Z [rank2]:[W1204 13:24:08.896576526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7078706Z [rank3]:[W1204 13:24:08.899235567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7078880Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7079135Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7079299Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7079666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7079867Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7079973Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7080068Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7080165Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7080168Z 2025-12-04T13:44:25.7080416Z [rank3]:[W1204 13:24:08.901425368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7080586Z [rank1]:[W1204 13:24:09.516549411 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7080759Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7081030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7081194Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7081588Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7081790Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7081893Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7081990Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7082084Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7082088Z 2025-12-04T13:44:25.7082321Z [rank1]:[W1204 13:24:09.517812332 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7082489Z [rank2]:[W1204 13:24:09.896706920 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7082663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7082917Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7083079Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7083452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7083652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7083756Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7083851Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7083948Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7083950Z 2025-12-04T13:44:25.7084184Z [rank2]:[W1204 13:24:09.898948760 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7084370Z [rank3]:[W1204 13:24:09.901517423 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7084544Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7084799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7084970Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7085360Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7085562Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7085667Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7085762Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7085859Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7085861Z 2025-12-04T13:44:25.7086093Z [rank3]:[W1204 13:24:09.903439941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7086266Z [rank1]:[W1204 13:24:10.518017486 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7086439Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7086692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7086854Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7087219Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7087422Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7087567Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7091883Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7091987Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7091992Z 2025-12-04T13:44:25.7092227Z [rank1]:[W1204 13:24:10.519476983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7092399Z [rank2]:[W1204 13:24:10.899370079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7092615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7092872Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7093053Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7093421Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7093656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7093762Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7093857Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7093956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7093957Z 2025-12-04T13:44:25.7094194Z [rank2]:[W1204 13:24:10.901541941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7094365Z [rank3]:[W1204 13:24:10.903564146 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7094540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7094793Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7094960Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7095325Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7095531Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7095636Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7095730Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7095826Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7095828Z 2025-12-04T13:44:25.7096061Z [rank3]:[W1204 13:24:10.905541482 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7096234Z [rank1]:[W1204 13:24:11.519655798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7096421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7096675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7096837Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7097216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7097439Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7097573Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7097669Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7097764Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7097766Z 2025-12-04T13:44:25.7098001Z [rank1]:[W1204 13:24:11.520902440 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7098170Z [rank2]:[W1204 13:24:11.901694216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7098347Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7098603Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7098768Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7099135Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7099337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7099442Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7099537Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7099635Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7099636Z 2025-12-04T13:44:25.7099870Z [rank2]:[W1204 13:24:11.903551235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7100040Z [rank3]:[W1204 13:24:11.905676598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7100216Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7100488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7100652Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7101041Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7101243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7101373Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7101470Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7101567Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7101569Z 2025-12-04T13:44:25.7101804Z [rank3]:[W1204 13:24:11.907908708 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7101974Z [rank1]:[W1204 13:24:12.521330590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7102147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7102403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7102565Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7102935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7103136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7103241Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7103337Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7103432Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7103434Z 2025-12-04T13:44:25.7103666Z [rank1]:[W1204 13:24:12.523473882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7103836Z [rank2]:[W1204 13:24:12.903683041 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7104010Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7104277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7104439Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7104804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7105016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7105140Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7105235Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7105333Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7105334Z 2025-12-04T13:44:25.7105567Z [rank2]:[W1204 13:24:12.905778395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7105735Z [rank3]:[W1204 13:24:12.908011105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7105910Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7106164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7106329Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7106693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7106895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7106999Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7107095Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7107193Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7107195Z 2025-12-04T13:44:25.7107426Z [rank3]:[W1204 13:24:12.909993481 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7107639Z [rank1]:[W1204 13:24:13.523625709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7107812Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7108066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7108250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7108618Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7108832Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7108935Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7109055Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7109150Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7109152Z 2025-12-04T13:44:25.7109387Z [rank1]:[W1204 13:24:13.524867051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7109556Z [rank2]:[W1204 13:24:13.905917452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7109730Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7109985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7110148Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7110514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7110714Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7110820Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7110915Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7111013Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7111015Z 2025-12-04T13:44:25.7111248Z [rank2]:[W1204 13:24:13.907893388 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7111418Z [rank3]:[W1204 13:24:13.910087029 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7111593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7111850Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7112013Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7112395Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7112595Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7112709Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7112803Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7112899Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7112925Z 2025-12-04T13:44:25.7113157Z [rank3]:[W1204 13:24:13.912131484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7113326Z [rank1]:[W1204 13:24:14.525021648 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7113498Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7113756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7113920Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7114286Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7114485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7114589Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7114684Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7114779Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7114783Z 2025-12-04T13:44:25.7115016Z [rank1]:[W1204 13:24:14.526259981 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7115186Z [rank2]:[W1204 13:24:14.908047816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7115358Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7115613Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7115775Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7116157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7116358Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7116462Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7116557Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7116665Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7116667Z 2025-12-04T13:44:25.7116909Z [rank2]:[W1204 13:24:14.910154439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7117089Z [rank3]:[W1204 13:24:14.912210363 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7117263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7117561Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7117725Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7118097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7118299Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7118403Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7118498Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7118594Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7118596Z 2025-12-04T13:44:25.7118827Z [rank3]:[W1204 13:24:14.914231968 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7119000Z [rank1]:[W1204 13:24:15.526411089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7119174Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7119427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7119589Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7119955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7120174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7120278Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7120373Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7120468Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7120482Z 2025-12-04T13:44:25.7120717Z [rank1]:[W1204 13:24:15.527648631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7120915Z [rank2]:[W1204 13:24:15.910324916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7121088Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7121342Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7121503Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7121870Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7122072Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7122176Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7122271Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7122368Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7122370Z 2025-12-04T13:44:25.7122604Z [rank2]:[W1204 13:24:15.912480278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7122775Z [rank3]:[W1204 13:24:15.914352287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7122949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7123203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7123365Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7123730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7123942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7124047Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7124140Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7124236Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7124238Z 2025-12-04T13:44:25.7124484Z [rank3]:[W1204 13:24:15.916441811 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7124657Z [rank1]:[W1204 13:24:16.527801839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7124856Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7125113Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7125276Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7125641Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7125844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7125948Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7126042Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7126138Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7126139Z 2025-12-04T13:44:25.7126373Z [rank1]:[W1204 13:24:16.529435833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7126544Z [rank2]:[W1204 13:24:16.912599048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7126720Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7126978Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7127140Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7127551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7127752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7127871Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7127967Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7128062Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7128064Z 2025-12-04T13:44:25.7128296Z [rank2]:[W1204 13:24:16.914613773 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7128477Z [rank3]:[W1204 13:24:16.916546270 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7128684Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7128942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7129107Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7129472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7129673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7129778Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7129873Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7129969Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7129970Z 2025-12-04T13:44:25.7130203Z [rank3]:[W1204 13:24:16.918505257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7130373Z [rank1]:[W1204 13:24:17.529581992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7130546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7130801Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7130963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7131331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7131531Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7131635Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7131745Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7131840Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7131842Z 2025-12-04T13:44:25.7132074Z [rank1]:[W1204 13:24:17.531059490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7132256Z [rank2]:[W1204 13:24:17.914747553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7132430Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7132706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7132869Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7133236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7133438Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7133544Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7133639Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7133736Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7133738Z 2025-12-04T13:44:25.7133972Z [rank2]:[W1204 13:24:17.916849646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7134140Z [rank3]:[W1204 13:24:17.918618157 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7134314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7134569Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7134734Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7135102Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7135301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7135405Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7135501Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7135607Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7135609Z 2025-12-04T13:44:25.7135840Z [rank3]:[W1204 13:24:17.921243948 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7136009Z [rank1]:[W1204 13:24:18.531234539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7136192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7136447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7136637Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7137001Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7137203Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7137306Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7137403Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7137541Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7137543Z 2025-12-04T13:44:25.7137780Z [rank1]:[W1204 13:24:18.533090178 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7137951Z [rank2]:[W1204 13:24:18.917027906 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7138124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7138380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7138543Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7138909Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7139109Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7139214Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7139310Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7139407Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7139409Z 2025-12-04T13:44:25.7139660Z [rank2]:[W1204 13:24:18.918987692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7139831Z [rank3]:[W1204 13:24:18.921368129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7140006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7140273Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7140469Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7140836Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7141037Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7141140Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7141235Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7141331Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7141335Z 2025-12-04T13:44:25.7141566Z [rank3]:[W1204 13:24:18.923379395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7141736Z [rank1]:[W1204 13:24:19.533237108 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7141908Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7142164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7142326Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7142692Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7142892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7142994Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7143090Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7143185Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7143189Z 2025-12-04T13:44:25.7143422Z [rank1]:[W1204 13:24:19.534464191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7143601Z [rank2]:[W1204 13:24:19.919062824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7143774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7144038Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7144201Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7144592Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7144793Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7144896Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7144991Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7145088Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7145090Z 2025-12-04T13:44:25.7145324Z [rank2]:[W1204 13:24:19.921101519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7145495Z [rank3]:[W1204 13:24:19.923474536 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7145669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7145921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7146083Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7146452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7146653Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7146758Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7146852Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7146949Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7146951Z 2025-12-04T13:44:25.7147182Z [rank3]:[W1204 13:24:19.925640608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7147363Z [rank1]:[W1204 13:24:20.534586993 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7147570Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7147823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7148001Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7148366Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7148596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7148701Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7148796Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7148890Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7148894Z 2025-12-04T13:44:25.7149126Z [rank1]:[W1204 13:24:20.535785936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7149297Z [rank2]:[W1204 13:24:20.921279320 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7149471Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7149725Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7149886Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7150256Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7150460Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7150564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7150660Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7150756Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7150758Z 2025-12-04T13:44:25.7150993Z [rank2]:[W1204 13:24:20.923521820 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7151163Z [rank3]:[W1204 13:24:20.925766090 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7151349Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7151602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7151765Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7152140Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7152362Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7152467Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7152560Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7152655Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7152657Z 2025-12-04T13:44:25.7152894Z [rank3]:[W1204 13:24:20.926931544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7153066Z [rank1]:[W1204 13:24:21.535933578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7153240Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7153494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7153660Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7154024Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7154225Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7154329Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7154424Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7154519Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7154522Z 2025-12-04T13:44:25.7154756Z [rank1]:[W1204 13:24:21.537268818 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7154926Z [rank2]:[W1204 13:24:21.923715351 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7155103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7155368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7155531Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7155915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7156141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7156245Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7156343Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7156438Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7156440Z 2025-12-04T13:44:25.7156673Z [rank2]:[W1204 13:24:21.925491061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7156841Z [rank3]:[W1204 13:24:21.927065956 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7157016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7157271Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7157435Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7157848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7158049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7158155Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7158250Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7158346Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7158348Z 2025-12-04T13:44:25.7158579Z [rank3]:[W1204 13:24:21.928800658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7158749Z [rank1]:[W1204 13:24:22.537400341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7158923Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7159190Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7159353Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7159732Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7159933Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7160064Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7160160Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7160255Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7160257Z 2025-12-04T13:44:25.7160491Z [rank1]:[W1204 13:24:22.538616404 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7160661Z [rank2]:[W1204 13:24:22.925653194 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7160834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7161090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7161253Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7161620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7161826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7161930Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7162026Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7162123Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7162125Z 2025-12-04T13:44:25.7162359Z [rank2]:[W1204 13:24:22.927664439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7162527Z [rank3]:[W1204 13:24:22.929029329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7162701Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7162955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7163129Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7163495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7163707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7163813Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7163930Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7164027Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7164030Z 2025-12-04T13:44:25.7164264Z [rank3]:[W1204 13:24:22.931033584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7164435Z [rank1]:[W1204 13:24:23.538752427 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7164609Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7164861Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7165025Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7165389Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7165591Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7165695Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7165791Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7165888Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7165889Z 2025-12-04T13:44:25.7166128Z [rank1]:[W1204 13:24:23.539975310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7166300Z [rank2]:[W1204 13:24:23.927826662 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7166472Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7166726Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7166889Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7167265Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7167467Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7167620Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7167715Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7167836Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7167838Z 2025-12-04T13:44:25.7168073Z [rank2]:[W1204 13:24:23.929683831 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7168243Z [rank3]:[W1204 13:24:23.931131119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7168418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7168675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7168838Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7169204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7169404Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7169509Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7169604Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7169700Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7169704Z 2025-12-04T13:44:25.7169935Z [rank3]:[W1204 13:24:23.932669724 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7170104Z [rank1]:[W1204 13:24:24.540078374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7170277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7170534Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7170697Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7171082Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7171282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7171386Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7171492Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7171588Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7171613Z 2025-12-04T13:44:25.7171845Z [rank1]:[W1204 13:24:24.541303337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7172015Z [rank2]:[W1204 13:24:24.929953342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7172188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7172442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7172605Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7172974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7173177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7173282Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7173378Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7173474Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7173476Z 2025-12-04T13:44:25.7173710Z [rank2]:[W1204 13:24:24.932347949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7173882Z [rank3]:[W1204 13:24:24.932807409 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7174056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7174313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7174477Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7174855Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7175057Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7175161Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7175254Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7175360Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7175362Z 2025-12-04T13:44:25.7175594Z [rank3]:[W1204 13:24:24.934816544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7175787Z [rank1]:[W1204 13:24:25.541454371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7175962Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7176214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7176379Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7176743Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7176951Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7177055Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7177150Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7177245Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7177247Z 2025-12-04T13:44:25.7177527Z [rank1]:[W1204 13:24:25.542767252 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7177700Z [rank2]:[W1204 13:24:25.932504803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7177874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7178129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7178290Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7178656Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7178874Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7178977Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7179073Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7179168Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7179170Z 2025-12-04T13:44:25.7179416Z [rank2]:[W1204 13:24:25.933924351 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7179616Z [rank3]:[W1204 13:24:25.934948629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7179790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7180044Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7180205Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7180570Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7180773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7180877Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7180971Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7181066Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7181068Z 2025-12-04T13:44:25.7181304Z [rank3]:[W1204 13:24:26.937039092 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7181667Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.7181708Z warnings.warn( 2025-12-04T13:44:25.7181879Z [rank1]:[W1204 13:24:26.542906637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7182052Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7182306Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7182469Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7182843Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7183044Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7183148Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7183242Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7183347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7183350Z 2025-12-04T13:44:25.7183586Z [rank1]:[W1204 13:24:26.544257507 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7183788Z [rank2]:[W1204 13:24:26.934040477 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7183961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7184214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7184378Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7184744Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7184946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7185049Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7185145Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7185242Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7185243Z 2025-12-04T13:44:25.7185478Z [rank2]:[W1204 13:24:26.935954385 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7185650Z [rank3]:[W1204 13:24:27.937177128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7185823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7186078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7186238Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7186603Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7186817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7186922Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7187016Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7187112Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7187114Z 2025-12-04T13:44:25.7187357Z [rank3]:[W1204 13:24:27.939172203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7187577Z [rank1]:[W1204 13:24:27.544402243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7187754Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7188012Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7188175Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7188542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7188746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7188850Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7188944Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7189040Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7189042Z 2025-12-04T13:44:25.7189273Z [rank1]:[W1204 13:24:27.545621976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7189445Z [rank2]:[W1204 13:24:27.936077171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7189618Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7189873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7190035Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7190403Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7190621Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7190724Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7190820Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7190916Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7190917Z 2025-12-04T13:44:25.7191162Z [rank2]:[W1204 13:24:28.938052497 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7191332Z [rank3]:[W1204 13:24:28.939276570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7191534Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7191790Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7191951Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7192318Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7192521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7192627Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7192722Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7192817Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7192819Z 2025-12-04T13:44:25.7193052Z [rank3]:[W1204 13:24:28.940934843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7193220Z [rank1]:[W1204 13:24:28.545749262 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7193396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7193649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7193810Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7194178Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7194379Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7194494Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7194590Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7194686Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7194688Z 2025-12-04T13:44:25.7194920Z [rank1]:[W1204 13:24:28.546980725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7195100Z [rank2]:[W1204 13:24:29.938150615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7195284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7195550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7195713Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7196080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7196283Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7196388Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7196484Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7196580Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7196582Z 2025-12-04T13:44:25.7196818Z [rank2]:[W1204 13:24:29.939384987 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7196989Z [rank3]:[W1204 13:24:29.941045341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7197162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7197421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7197618Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7197986Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7198187Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7198292Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7198403Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7198498Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7198500Z 2025-12-04T13:44:25.7198732Z [rank3]:[W1204 13:24:29.942612676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7198915Z [rank1]:[W1204 13:24:29.547114092 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7199091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7199377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7199540Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7199905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7200106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7200212Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7200307Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7200403Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7200405Z 2025-12-04T13:44:25.7200636Z [rank1]:[W1204 13:24:29.548610099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7200806Z [rank2]:[W1204 13:24:30.939502055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7200979Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7201240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7201407Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7201772Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7201974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7202077Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7202174Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7202270Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7202281Z 2025-12-04T13:44:25.7202513Z [rank2]:[W1204 13:24:30.941217807 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7202683Z [rank3]:[W1204 13:24:30.942704614 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7202870Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7203127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7203313Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7203681Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7203881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7203985Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7204081Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7204179Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7204181Z 2025-12-04T13:44:25.7204415Z [rank3]:[W1204 13:24:30.943829589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7204583Z [rank1]:[W1204 13:24:30.548704067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7204757Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7205011Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7205176Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7205548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7205749Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7205853Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7205948Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7206046Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7206048Z 2025-12-04T13:44:25.7206292Z [rank1]:[W1204 13:24:30.549913001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7206462Z [rank2]:[W1204 13:24:31.941401024 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7206634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7206897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7207082Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7207446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7207742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7207846Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7207944Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7208040Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7208043Z 2025-12-04T13:44:25.7208279Z [rank2]:[W1204 13:24:31.943467438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7208448Z [rank3]:[W1204 13:24:31.943949337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7208620Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7208876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7209036Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7209404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7209605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7209709Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7209805Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7209900Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7209904Z 2025-12-04T13:44:25.7210137Z [rank3]:[W1204 13:24:31.945841395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7210320Z [rank1]:[W1204 13:24:31.550057339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7210495Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7210761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7210923Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7211320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7211521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7211627Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7211721Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7211819Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7211821Z 2025-12-04T13:44:25.7212057Z [rank1]:[W1204 13:24:31.551282871 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7212231Z [rank2]:[W1204 13:24:32.943637236 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7212406Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7212663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7212826Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7213192Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7213395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7213498Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7213594Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7213692Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7213695Z 2025-12-04T13:44:25.7213927Z [rank2]:[W1204 13:24:32.945315759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7214107Z [rank3]:[W1204 13:24:32.946034403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7214282Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7214538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7214715Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7215081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7215306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7215410Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7215506Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7215601Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7215603Z 2025-12-04T13:44:25.7215838Z [rank3]:[W1204 13:24:32.947215116 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7216009Z [rank1]:[W1204 13:24:32.551421580 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7216184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7216440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7216603Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7216968Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7217170Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7217275Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7217368Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7217465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7217466Z 2025-12-04T13:44:25.7217747Z [rank1]:[W1204 13:24:32.552659903 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7217919Z [rank2]:[W1204 13:24:33.945506656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7218106Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7218359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7218522Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7218907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7219134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7219237Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7219333Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7219429Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7219433Z 2025-12-04T13:44:25.7219664Z [rank2]:[W1204 13:24:33.947106731 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7219835Z [rank3]:[W1204 13:24:33.947353135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7220011Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7220267Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7220430Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7220799Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7221005Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7221109Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7221204Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7221299Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7221301Z 2025-12-04T13:44:25.7221534Z [rank3]:[W1204 13:24:33.948830333 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7221704Z [rank1]:[W1204 13:24:33.552791533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7221879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7222143Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7222307Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7222681Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7222890Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7223007Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7223102Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7223198Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7223200Z 2025-12-04T13:44:25.7223434Z [rank1]:[W1204 13:24:33.554019845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7223605Z [rank2]:[W1204 13:24:34.947353728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7223780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7224036Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7224200Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7224566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7224768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7224873Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7224970Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7225066Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7225069Z 2025-12-04T13:44:25.7225302Z [rank2]:[W1204 13:24:34.949605228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7225474Z [rank3]:[W1204 13:24:34.948923873 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7225646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7225914Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7226076Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7226442Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7226655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7226780Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7226876Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7226973Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7226975Z 2025-12-04T13:44:25.7227209Z [rank3]:[W1204 13:24:34.950325272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7227378Z [rank1]:[W1204 13:24:34.554155725 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7227584Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7227839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7228004Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7228371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7228574Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7228679Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7228775Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7228871Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7228873Z 2025-12-04T13:44:25.7229106Z [rank1]:[W1204 13:24:34.555445707 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7229275Z [rank2]:[W1204 13:24:35.949777668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7229450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7229707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7229892Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7230259Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7230477Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7230581Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7230704Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7230801Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7230804Z 2025-12-04T13:44:25.7231036Z [rank2]:[W1204 13:24:35.952019328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7231206Z [rank3]:[W1204 13:24:35.950443883 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7231380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7231636Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7231802Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7232173Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7232375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7232479Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7232574Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7232670Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7232672Z 2025-12-04T13:44:25.7232905Z [rank3]:[W1204 13:24:35.952625595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7233262Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T13:44:25.7233301Z warnings.warn( 2025-12-04T13:44:25.7233471Z [rank1]:[W1204 13:24:35.555536079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7233646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7233916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7234079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7234457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7234658Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7234785Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7234880Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7234976Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7234978Z 2025-12-04T13:44:25.7235212Z [rank1]:[W1204 13:24:35.556724272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7235381Z [rank2]:[W1204 13:24:36.952168679 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7235554Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7235811Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7235973Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7236342Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7236544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7236650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7236744Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7236842Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7236843Z 2025-12-04T13:44:25.7237075Z [rank2]:[W1204 13:24:36.953372682 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7237244Z [rank3]:[W1204 13:24:36.952801105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7237417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7237714Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7237891Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7238259Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7238473Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7238577Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7238698Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7238793Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7238795Z 2025-12-04T13:44:25.7239027Z [rank3]:[W1204 13:24:36.954151875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7239196Z [rank1]:[W1204 13:24:36.556883103 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7239370Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7239625Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7239788Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7240152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7240353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7240457Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7240553Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7240649Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7240653Z 2025-12-04T13:44:25.7240886Z [rank1]:[W1204 13:24:36.558138555 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7241054Z [rank2]:[W1204 13:24:37.953522813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7241228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7241480Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7241655Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7242020Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7242222Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7242339Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7242435Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7242560Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7242562Z 2025-12-04T13:44:25.7242798Z [rank2]:[W1204 13:24:37.954765916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7242969Z [rank3]:[W1204 13:24:37.954302216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7243142Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7243398Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7243563Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7243927Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7244127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7244230Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7244326Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7244422Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7244424Z 2025-12-04T13:44:25.7244657Z [rank3]:[W1204 13:24:37.955602427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7244830Z [rank1]:[W1204 13:24:37.558304837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7245005Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7245261Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7245422Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7245799Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7245999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7246103Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7246207Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7246305Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7246327Z 2025-12-04T13:44:25.7246560Z [rank1]:[W1204 13:24:37.559536319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7246729Z [rank2]:[W1204 13:24:38.954924698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7246902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7247164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7247328Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7247740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7247941Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7248045Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7248140Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7248237Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7248240Z 2025-12-04T13:44:25.7248473Z [rank2]:[W1204 13:24:38.956209579 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7248644Z [rank3]:[W1204 13:24:38.955772909 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7248817Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7249074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7249237Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7249623Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7249824Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7249927Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7250022Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7250129Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7250131Z 2025-12-04T13:44:25.7250364Z [rank3]:[W1204 13:24:38.957023151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7250566Z [rank1]:[W1204 13:24:38.559699921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7250738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7250992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7251154Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7251526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7251729Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7251833Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7251928Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7252026Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7252028Z 2025-12-04T13:44:25.7252260Z [rank1]:[W1204 13:24:38.561694967 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7252430Z [rank2]:[W1204 13:24:39.956329842 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7252604Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7252857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7253021Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7253384Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7253599Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7253703Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7253797Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7253894Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7253896Z 2025-12-04T13:44:25.7254137Z [rank2]:[W1204 13:24:39.958323918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7254332Z [rank3]:[W1204 13:24:39.957201943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7254506Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7254761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7254923Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7255288Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7255492Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7255594Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7255690Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7255785Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7255787Z 2025-12-04T13:44:25.7256022Z [rank3]:[W1204 13:24:39.959309456 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7256194Z [rank1]:[W1204 13:24:39.561950617 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7256368Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7256622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7256783Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7257149Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7257362Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7257467Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7257600Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7257695Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7257697Z 2025-12-04T13:44:25.7257947Z [rank1]:[W1204 13:24:39.564443102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7258118Z [rank2]:[W1204 13:24:40.958481391 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7258330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7258584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7258746Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7259116Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7259318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7259423Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7259517Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7259613Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7259615Z 2025-12-04T13:44:25.7259847Z [rank2]:[W1204 13:24:40.960624563 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7260017Z [rank3]:[W1204 13:24:40.959479549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7260193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7260450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7260612Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7260976Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7261177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7261295Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7261390Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7261485Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7261486Z 2025-12-04T13:44:25.7261719Z [rank3]:[W1204 13:24:40.961536893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7261900Z [rank1]:[W1204 13:24:40.564605705 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7262073Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7262355Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7262519Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7262887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7263088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7263194Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7263289Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7263384Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7263386Z 2025-12-04T13:44:25.7263618Z [rank1]:[W1204 13:24:40.566863155 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7263787Z [rank2]:[W1204 13:24:41.960759958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7263961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7264217Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7264379Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7264751Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7264952Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7265057Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7265152Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7265260Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7265262Z 2025-12-04T13:44:25.7265493Z [rank2]:[W1204 13:24:41.962957899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7265662Z [rank3]:[W1204 13:24:41.961740596 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7265847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7266127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7266291Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7266654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7266858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7266961Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7267058Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7267154Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7267156Z 2025-12-04T13:44:25.7267389Z [rank3]:[W1204 13:24:41.963828929 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7267601Z [rank1]:[W1204 13:24:41.567049869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7267776Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7268032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7268196Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7268564Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7268764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7268870Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7268967Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7269063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7269065Z 2025-12-04T13:44:25.7269313Z [rank1]:[W1204 13:24:41.568397379 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7269482Z [rank2]:[W1204 13:24:42.963134923 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7269668Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7269923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7270111Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7270478Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7270679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7270784Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7270879Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7270978Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7270980Z 2025-12-04T13:44:25.7271218Z [rank2]:[W1204 13:24:42.965449701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7271389Z [rank3]:[W1204 13:24:42.964013673 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7271562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7271818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7271982Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7272346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7272546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7272650Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7272745Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7272841Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7272843Z 2025-12-04T13:44:25.7273087Z [rank3]:[W1204 13:24:42.966014769 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7273259Z [rank1]:[W1204 13:24:42.568571573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7273433Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7273703Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7273866Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7274255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7274456Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7274559Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7274655Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7274750Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7274753Z 2025-12-04T13:44:25.7274989Z [rank1]:[W1204 13:24:42.570320954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7275157Z [rank2]:[W1204 13:24:43.965595016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7275330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7275587Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7275750Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7276117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7276317Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7276421Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7276515Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7276613Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7276616Z 2025-12-04T13:44:25.7276848Z [rank2]:[W1204 13:24:43.967945054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7277030Z [rank3]:[W1204 13:24:43.966186943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7277205Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7277459Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7277677Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7278070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7278271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7278374Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7278469Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7278565Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7278568Z 2025-12-04T13:44:25.7278800Z [rank3]:[W1204 13:24:43.968265417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7278973Z [rank1]:[W1204 13:24:43.570492579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7279146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7279403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7279566Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7279933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7280137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7280240Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7280335Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7280430Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7280433Z 2025-12-04T13:44:25.7280666Z [rank1]:[W1204 13:24:43.572654561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7280712Z PASSED [286.9055s] [100%] 2025-12-04T13:44:25.7280715Z 2025-12-04T13:44:25.7280982Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-c043a2bb54ab4c8d.xml - 2025-12-04T13:44:25.7281059Z ================= 1 passed, 61 deselected in 286.92s (0:04:46) ================= 2025-12-04T13:44:25.7281234Z [rank2]:[W1204 13:24:44.968127609 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7281407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7281677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7281865Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7282232Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7282434Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7282539Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7282634Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7282731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7282733Z 2025-12-04T13:44:25.7282968Z [rank2]:[W1204 13:24:44.970333200 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7283138Z [rank3]:[W1204 13:24:44.968378043 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7283312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7283568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7283733Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7284103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7284304Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7284409Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7284504Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7284602Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7284605Z 2025-12-04T13:44:25.7284850Z [rank3]:[W1204 13:24:44.970465637 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7285021Z [rank1]:[W1204 13:24:44.572817567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7285197Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7285461Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7285626Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7286015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7286217Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7286322Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7286417Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7286515Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7286518Z 2025-12-04T13:44:25.7286751Z [rank1]:[W1204 13:24:44.575216224 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7286789Z Got exit code 0 2025-12-04T13:44:25.7286872Z Test succeeded in new process, continuing with the rest of the tests 2025-12-04T13:44:25.7287045Z [rank3]:[W1204 13:24:45.970553415 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7287218Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7287515Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7287680Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7288046Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7288249Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7288354Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7288449Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7288546Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7288550Z 2025-12-04T13:44:25.7288803Z [rank3]:[W1204 13:24:45.971704599 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7288973Z [rank2]:[W1204 13:24:45.970518806 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7289146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7289417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7289581Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7289977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7290180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7290283Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7290379Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7290475Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7290478Z 2025-12-04T13:44:25.7290713Z [rank2]:[W1204 13:24:45.972006283 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7290882Z [rank1]:[W1204 13:24:45.575394739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7291056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7291311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7291474Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7291843Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7292043Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7292147Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7292242Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7292338Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7292340Z 2025-12-04T13:44:25.7292573Z [rank1]:[W1204 13:24:45.577190590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7292755Z [rank3]:[W1204 13:24:46.971893545 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7292932Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7293188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7293361Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7293747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7293948Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7294052Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7294147Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7294243Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7294247Z 2025-12-04T13:44:25.7294478Z [rank3]:[W1204 13:24:46.974038668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7294651Z [rank2]:[W1204 13:24:46.972200439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7294824Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7295079Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7295242Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7295611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7295813Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7295917Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7296012Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7296107Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7296109Z 2025-12-04T13:44:25.7296343Z [rank2]:[W1204 13:24:46.974503717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7296526Z [rank1]:[W1204 13:24:46.577346617 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7296700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7296955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7297130Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7297538Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7297766Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7297870Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7297963Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7298059Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7298061Z 2025-12-04T13:44:25.7298294Z [rank1]:[W1204 13:24:46.579537478 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7298465Z [rank3]:[W1204 13:24:47.974250744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7298640Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7298894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7299058Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7299426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7299630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7299733Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7299829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7299924Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7299927Z 2025-12-04T13:44:25.7300160Z [rank3]:[W1204 13:24:47.975912747 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7300330Z [rank2]:[W1204 13:24:47.974689974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7300517Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7300774Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7300936Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7301316Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7301546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7301651Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7301746Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7301841Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7301843Z 2025-12-04T13:44:25.7302077Z [rank2]:[W1204 13:24:47.976908735 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7302245Z [rank1]:[W1204 13:24:47.579703916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7302422Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7302677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7302839Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7303206Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7303408Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7303514Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7303608Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7303704Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7303705Z 2025-12-04T13:44:25.7303938Z [rank1]:[W1204 13:24:47.580958668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7304109Z [rank3]:[W1204 13:24:48.976113604 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7304286Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7304552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7304716Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7305094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7305296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7305421Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7305517Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7305614Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7305616Z 2025-12-04T13:44:25.7305851Z [rank3]:[W1204 13:24:48.978178028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7306021Z [rank2]:[W1204 13:24:48.977078352 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7306196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7306452Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7306614Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7306981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7307183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7307288Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7307385Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7307528Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7307530Z 2025-12-04T13:44:25.7307768Z [rank2]:[W1204 13:24:48.978406663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7307939Z [rank1]:[W1204 13:24:48.581134216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7308114Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7308383Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7308545Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7308910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7309124Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7309256Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7309350Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7309447Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7309449Z 2025-12-04T13:44:25.7309680Z [rank1]:[W1204 13:24:48.582375488 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7309851Z [rank2]:[W1204 13:24:49.978554652 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7310028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7310283Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7310449Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7310815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7311018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7311121Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7311219Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7311319Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7311321Z 2025-12-04T13:44:25.7311552Z [rank2]:[W1204 13:24:49.979787294 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7311722Z [rank3]:[W1204 13:24:49.978313137 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7311896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7312150Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7312325Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7312693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7312908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7313012Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7313132Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7313227Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7313229Z 2025-12-04T13:44:25.7313461Z [rank3]:[W1204 13:24:49.980296863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7313663Z Test results will be stored in test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-fe62af45fb6188fc.xml 2025-12-04T13:44:25.7313726Z ============================= test session starts ============================== 2025-12-04T13:44:25.7313839Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:44:25.7313884Z cachedir: .pytest_cache 2025-12-04T13:44:25.7314043Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:44:25.7314094Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:44:25.7314136Z configfile: pytest.ini 2025-12-04T13:44:25.7314302Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:44:25.7314380Z collecting ... collected 62 items / 26 deselected / 36 selected 2025-12-04T13:44:25.7314435Z stepcurrent: skipping 26 already run items. 2025-12-04T13:44:25.7314481Z Running 36 items in this shard 2025-12-04T13:44:25.7314483Z 2025-12-04T13:44:25.7314739Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr I1204 13:24:49.531000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 68233 2025-12-04T13:44:25.7314893Z I1204 13:24:49.531000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 68234 2025-12-04T13:44:25.7315043Z I1204 13:24:49.532000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 68235 2025-12-04T13:44:25.7315194Z I1204 13:24:49.532000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 68236 2025-12-04T13:44:25.7315364Z [rank1]:[W1204 13:24:49.582499108 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7315540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7315799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7315963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7316347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7316548Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7316656Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7316762Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7316861Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7316886Z 2025-12-04T13:44:25.7317129Z [rank1]:[W1204 13:24:49.583698051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7317299Z [rank2]:[W1204 13:24:50.979933714 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7317513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7317766Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7317930Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7318299Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7318501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7318604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7318703Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7318803Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7318805Z 2025-12-04T13:44:25.7319039Z [rank2]:[W1204 13:24:50.982074647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7319213Z [rank3]:[W1204 13:24:50.980466193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7319386Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7319644Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7319805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7320188Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7320391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7320496Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7320593Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7320689Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7320703Z 2025-12-04T13:44:25.7320937Z [rank3]:[W1204 13:24:50.982155455 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7321139Z [rank1]:[W1204 13:24:50.583856321 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7321315Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7321569Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7321732Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7322099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7322302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7322408Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7322503Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7322601Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7322603Z 2025-12-04T13:44:25.7322836Z [rank1]:[W1204 13:24:50.585718700 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7323008Z [rank3]:[W1204 13:24:51.982337344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7323185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7323440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7323602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7323967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7324182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7324285Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7324382Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7324480Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7324482Z 2025-12-04T13:44:25.7324725Z [rank3]:[W1204 13:24:51.983586607 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7324909Z [rank2]:[W1204 13:24:51.982237196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7325095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7325354Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7325518Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7325887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7326092Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7326195Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7326292Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7326388Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7326390Z 2025-12-04T13:44:25.7326624Z [rank2]:[W1204 13:24:51.984242042 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7326794Z [rank1]:[W1204 13:24:51.585918039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7326971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7327226Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7327391Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7327804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7328007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7328129Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7328224Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7328324Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7328326Z 2025-12-04T13:44:25.7328559Z [rank1]:[W1204 13:24:51.588212568 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7328744Z [rank3]:[W1204 13:24:52.983709118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7328947Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7329202Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7329364Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7329735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7329940Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7330045Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7330141Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7330237Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7330239Z 2025-12-04T13:44:25.7330471Z [rank3]:[W1204 13:24:52.984972720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7330642Z [rank2]:[W1204 13:24:52.984396242 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7330816Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7331074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7331236Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7331605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7331807Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7331912Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7332023Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7332120Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7332121Z 2025-12-04T13:44:25.7332353Z [rank2]:[W1204 13:24:52.985757132 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7332538Z [rank1]:[W1204 13:24:52.588378459 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7332712Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7332988Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7333151Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7333519Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7333721Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7333826Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7333921Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7334018Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7334019Z 2025-12-04T13:44:25.7334254Z [rank1]:[W1204 13:24:52.589626471 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7334424Z [rank3]:[W1204 13:24:53.985148810 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7334600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7334856Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7335022Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7335387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7335591Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7335694Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7335791Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7335897Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7335899Z 2025-12-04T13:44:25.7336131Z [rank3]:[W1204 13:24:53.986354853 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7336303Z [rank2]:[W1204 13:24:53.985930193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7336487Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7336744Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7336928Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7337299Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7341350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7341463Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7341565Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7341662Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7341664Z 2025-12-04T13:44:25.7341900Z [rank2]:[W1204 13:24:53.987741063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7342070Z [rank1]:[W1204 13:24:53.589783302 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7342246Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7342511Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7342678Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7343047Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7343248Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7343353Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7343448Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7343546Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7343548Z 2025-12-04T13:44:25.7343808Z [rank1]:[W1204 13:24:53.591026845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7343979Z [rank3]:[W1204 13:24:54.986562894 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7344154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7344424Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7344616Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7344987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7345187Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7345292Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7345389Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7345487Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7345491Z 2025-12-04T13:44:25.7345723Z [rank3]:[W1204 13:24:54.987938323 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7345893Z [rank2]:[W1204 13:24:54.987903854 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7346066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7346322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7346484Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7346858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7347059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7347162Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7347261Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7347357Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7347360Z 2025-12-04T13:44:25.7347654Z [rank2]:[W1204 13:24:54.989131437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7347823Z [rank1]:[W1204 13:24:54.591190816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7347998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7348265Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7348427Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7348817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7349018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7349123Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7349219Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7349316Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7349317Z 2025-12-04T13:44:25.7349551Z [rank1]:[W1204 13:24:54.592411439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7349722Z [rank3]:[W1204 13:24:55.988107905 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7349896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7350150Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7350313Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7350679Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7350880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7350985Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7351080Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7351177Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7351179Z 2025-12-04T13:44:25.7351411Z [rank3]:[W1204 13:24:55.989345018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7351600Z [rank2]:[W1204 13:24:55.989280799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7351775Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7352031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7352204Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7352572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7352794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7352897Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7352994Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7353090Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7353092Z 2025-12-04T13:44:25.7353325Z [rank2]:[W1204 13:24:55.990493882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7353497Z [rank1]:[W1204 13:24:55.592575552 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7353672Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7353927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7354089Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7354455Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7354657Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7354762Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7354857Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7354953Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7354955Z 2025-12-04T13:44:25.7355193Z [rank1]:[W1204 13:24:55.593813824 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7355364Z [rank3]:[W1204 13:24:56.989450891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7355549Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7355804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7355968Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7356352Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7356573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7356676Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7356771Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7356867Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7356869Z 2025-12-04T13:44:25.7357101Z [rank3]:[W1204 13:24:56.990589496 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7357272Z [rank2]:[W1204 13:24:56.990620016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7357450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7357748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7357912Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7358279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7358483Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7358587Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7358683Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7358779Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7358781Z 2025-12-04T13:44:25.7359016Z [rank2]:[W1204 13:24:56.991850268 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7359186Z [rank1]:[W1204 13:24:56.593993637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7359362Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7359637Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7359799Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7360176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7360403Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7360508Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7360603Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7360699Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7360701Z 2025-12-04T13:44:25.7360934Z [rank1]:[W1204 13:24:56.595865945 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7361102Z [rank3]:[W1204 13:24:57.990759659 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7361278Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7361534Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7361698Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7362068Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7362268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7362374Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7362469Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7362565Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7362567Z 2025-12-04T13:44:25.7362798Z [rank3]:[W1204 13:24:57.992620698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7362968Z [rank2]:[W1204 13:24:57.991996352 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7363141Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7363412Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7363574Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7363955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7364158Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7364289Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7364386Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7364482Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7364484Z 2025-12-04T13:44:25.7364718Z [rank2]:[W1204 13:24:57.993300933 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7364888Z [rank1]:[W1204 13:24:57.596028129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7365064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7365320Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7365481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7365848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7366049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7366155Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7366250Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7366347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7366349Z 2025-12-04T13:44:25.7366581Z [rank1]:[W1204 13:24:57.598808377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7366750Z [rank3]:[W1204 13:24:58.992795831 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7366926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7367180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7367355Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7367770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7367993Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7368097Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7368225Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7368323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7368324Z 2025-12-04T13:44:25.7368559Z [rank3]:[W1204 13:24:58.994124602 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7368730Z [rank2]:[W1204 13:24:58.993405498 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7368905Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7369160Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7369325Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7369693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7369895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7369998Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7370096Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7370191Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7370192Z 2025-12-04T13:44:25.7370429Z [rank2]:[W1204 13:24:58.994648400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7370599Z [rank1]:[W1204 13:24:58.598943242 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7370773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7371028Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7371190Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7371571Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7371771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7371887Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7371982Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7372099Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7372101Z 2025-12-04T13:44:25.7372334Z [rank1]:[W1204 13:24:58.600669384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7372542Z [rank1]:W1204 13:24:58.839000 68234 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.7372747Z [rank0]:W1204 13:24:58.843000 68233 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.7372918Z [rank3]:[W1204 13:24:59.994276527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7373095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7373350Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7373514Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7373882Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7374083Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7374188Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7374284Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7374381Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7374383Z 2025-12-04T13:44:25.7374615Z [rank3]:[W1204 13:24:59.995568558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7374787Z [rank2]:[W1204 13:24:59.994781135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7374963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7375230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7375395Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7375771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7375974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7376099Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7376196Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7376292Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7376297Z 2025-12-04T13:44:25.7376528Z [rank2]:[W1204 13:24:59.995998878 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7376697Z [rank1]:[W1204 13:24:59.600822888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7376871Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7377130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7377293Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7377696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7377899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7378005Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7378100Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7378196Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7378197Z 2025-12-04T13:44:25.7378431Z [rank1]:[W1204 13:24:59.605473985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7378598Z [rank3]:[W1204 13:25:00.995742503 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7378774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7379028Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7379217Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7379587Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7379800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7379904Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7380030Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7380127Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7380129Z 2025-12-04T13:44:25.7380359Z [rank3]:[W1204 13:25:00.996956816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7380530Z [rank2]:[W1204 13:25:00.996166783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7380707Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7380961Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7381126Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7381496Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7381704Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7381807Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7381905Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7382001Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7382004Z 2025-12-04T13:44:25.7382236Z [rank2]:[W1204 13:25:00.997410926 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7382405Z [rank1]:[W1204 13:25:00.605715779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7382578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7382832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7382994Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7383371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7383572Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7383689Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7383785Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7383909Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7383911Z 2025-12-04T13:44:25.7384145Z [rank1]:[W1204 13:25:00.607714345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7384314Z [rank3]:[W1204 13:25:01.997154850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7384488Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7384744Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7384907Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7385272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7385472Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7385577Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7385673Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7385769Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7385772Z 2025-12-04T13:44:25.7386005Z [rank3]:[W1204 13:25:01.998726386 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7386175Z [rank2]:[W1204 13:25:01.997561191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7386350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7386604Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7386769Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7387149Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7387350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7387453Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7387608Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7387704Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7387732Z 2025-12-04T13:44:25.7387970Z [rank2]:[W1204 13:25:01.999459559 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7388140Z [rank1]:[W1204 13:25:01.607911180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7388313Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7388568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7388729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7389097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7389298Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7389401Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7389496Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7389592Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7389594Z 2025-12-04T13:44:25.7389828Z [rank1]:[W1204 13:25:01.610044802 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7390001Z [rank2]:[W1204 13:25:02.999613686 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7390176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7390431Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7390595Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7390975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7391178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7391282Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7391376Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7391482Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7391484Z 2025-12-04T13:44:25.7391716Z [rank2]:[W1204 13:25:02.000827329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7391912Z [rank3]:[W1204 13:25:02.998942130 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7392087Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7392346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7392508Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7392873Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7393076Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7393180Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7393275Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7393373Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7393375Z 2025-12-04T13:44:25.7393608Z [rank3]:[W1204 13:25:02.001171901 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7393781Z [rank1]:[W1204 13:25:02.610207339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7393955Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7394210Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7394373Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7394740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7394958Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7395061Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7395157Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7395252Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7395253Z 2025-12-04T13:44:25.7395505Z [rank1]:[W1204 13:25:02.611433522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7395695Z [rank2]:[W1204 13:25:03.000963006 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7395869Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7396123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7396285Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7396658Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7396863Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7396967Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7397061Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7397157Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7397159Z 2025-12-04T13:44:25.7397392Z [rank2]:[W1204 13:25:03.003055280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7397601Z [rank3]:[W1204 13:25:03.001374447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7397778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7398033Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7398194Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7398560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7398763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7398886Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7398982Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7399079Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7399081Z 2025-12-04T13:44:25.7399325Z [rank3]:[W1204 13:25:03.004131246 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7399496Z [rank1]:[W1204 13:25:03.611580099 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7399696Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7399951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7400112Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7400479Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7400682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7400785Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7400882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7400979Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7400981Z 2025-12-04T13:44:25.7401218Z [rank1]:[W1204 13:25:03.613098775 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7401387Z [rank2]:[W1204 13:25:04.003231707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7401563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7401818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7401980Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7402345Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7402544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7402650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7402757Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7402854Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7402856Z 2025-12-04T13:44:25.7403093Z [rank2]:[W1204 13:25:04.004861670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7403276Z [rank3]:[W1204 13:25:04.004303913 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7403451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7403729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7403892Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7404258Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7404458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7404564Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7404660Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7404757Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7404759Z 2025-12-04T13:44:25.7404990Z [rank3]:[W1204 13:25:04.006241120 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7405160Z [rank1]:[W1204 13:25:04.613259833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7405334Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7405594Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7405755Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7406121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7406321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7406426Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7406522Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7406629Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7406631Z 2025-12-04T13:44:25.7406863Z [rank1]:[W1204 13:25:04.614822518 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7407032Z [rank2]:[W1204 13:25:05.004978149 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7407219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7407510Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7407703Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7408071Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7408272Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7408376Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7408473Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7408569Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7408571Z 2025-12-04T13:44:25.7408804Z [rank2]:[W1204 13:25:05.006205922 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7408973Z [rank3]:[W1204 13:25:05.006442227 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7409147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7409406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7409570Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7409937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7410139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7410243Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7410339Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7410436Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7410438Z 2025-12-04T13:44:25.7410688Z [rank3]:[W1204 13:25:05.008492801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7410860Z [rank1]:[W1204 13:25:05.614986797 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7411033Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7411303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7411487Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7411857Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7412058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7412162Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7412258Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7412354Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7412356Z 2025-12-04T13:44:25.7412589Z [rank1]:[W1204 13:25:05.616431885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7412758Z [rank2]:[W1204 13:25:06.006353381 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7412933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7413188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7413352Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7413721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7413921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7414026Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7414122Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7414218Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7414221Z 2025-12-04T13:44:25.7414470Z [rank2]:[W1204 13:25:06.007647802 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7414640Z [rank3]:[W1204 13:25:06.008661980 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7414813Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7415081Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7415244Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7415633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7415834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7415939Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7416035Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7416130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7416134Z 2025-12-04T13:44:25.7416365Z [rank3]:[W1204 13:25:06.010793002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7416535Z [rank1]:[W1204 13:25:06.616604563 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7416708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7416963Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7417124Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7417523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7417722Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7417825Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7417923Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7418019Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7418021Z 2025-12-04T13:44:25.7418257Z [rank1]:[W1204 13:25:06.617857035 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7418453Z [rank2]:[W1204 13:25:07.007769032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7418629Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7418883Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7419066Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7419433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7419669Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7419773Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7419868Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7419965Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7419967Z 2025-12-04T13:44:25.7420202Z [rank2]:[W1204 13:25:07.009023594 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7420373Z [rank3]:[W1204 13:25:07.010973201 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7420548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7420802Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7420965Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7421328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7421532Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7421636Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7421730Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7421826Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7421828Z 2025-12-04T13:44:25.7422060Z [rank3]:[W1204 13:25:07.013133963 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7422231Z [rank1]:[W1204 13:25:07.618040794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7422417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7422676Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7422837Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7423212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7423441Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7423544Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7423640Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7423735Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7423737Z 2025-12-04T13:44:25.7423970Z [rank1]:[W1204 13:25:07.620041300 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7424142Z [rank2]:[W1204 13:25:08.009182234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7424317Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7424575Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7424738Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7425105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7425307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7425411Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7425506Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7425603Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7425605Z 2025-12-04T13:44:25.7425839Z [rank2]:[W1204 13:25:08.011515822 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7426008Z [rank3]:[W1204 13:25:08.013313983 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7426185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7426449Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7426613Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7426993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7427215Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7427320Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7427416Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7427557Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7427559Z 2025-12-04T13:44:25.7427794Z [rank3]:[W1204 13:25:08.014998735 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7427963Z [rank1]:[W1204 13:25:08.620174861 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7428138Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7428393Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7428556Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7428921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7429124Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7429229Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7429325Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7429420Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7429422Z 2025-12-04T13:44:25.7429655Z [rank1]:[W1204 13:25:08.622195586 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7429825Z [rank2]:[W1204 13:25:09.011630774 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7430001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7430279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7430443Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7430824Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7431025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7431165Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7431262Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7431359Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7431360Z 2025-12-04T13:44:25.7431592Z [rank2]:[W1204 13:25:09.014029250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7431761Z [rank3]:[W1204 13:25:09.015138646 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7431937Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7432194Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7432356Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7432724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7432925Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7433032Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7433127Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7433223Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7433225Z 2025-12-04T13:44:25.7433462Z [rank3]:[W1204 13:25:09.017007994 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7433633Z [rank1]:[W1204 13:25:09.622383626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7433807Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7434064Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7434240Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7434604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7434814Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7434929Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7435035Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7435131Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7435133Z 2025-12-04T13:44:25.7435367Z [rank1]:[W1204 13:25:09.624832652 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7435537Z [rank2]:[W1204 13:25:10.014158222 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7435714Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7435970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7436136Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7436504Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7436705Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7436810Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7436905Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7437002Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7437005Z 2025-12-04T13:44:25.7437237Z [rank2]:[W1204 13:25:10.015366115 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7437405Z [rank3]:[W1204 13:25:10.017169025 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7437616Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7437874Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7438052Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7438418Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7438620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7438736Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7438832Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7438963Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7438965Z 2025-12-04T13:44:25.7439197Z [rank3]:[W1204 13:25:10.018771170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7439368Z [rank1]:[W1204 13:25:10.625007903 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7439540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7439796Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7439961Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7440327Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7440528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7440632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7440728Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7440824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7440826Z 2025-12-04T13:44:25.7441060Z [rank1]:[W1204 13:25:10.626883661 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7441230Z [rank2]:[W1204 13:25:11.015506627 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7441403Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7441658Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7441824Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7442209Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7442409Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7442513Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7442617Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7442715Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7442737Z 2025-12-04T13:44:25.7442970Z [rank2]:[W1204 13:25:11.017596801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7443139Z [rank3]:[W1204 13:25:11.018959951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7443314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7443570Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7443733Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7444106Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7444310Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7444414Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7444509Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7444605Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7444608Z 2025-12-04T13:44:25.7444840Z [rank3]:[W1204 13:25:11.020566775 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7445008Z [rank1]:[W1204 13:25:11.627060212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7445181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7445436Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7445598Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7445977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7446178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7446281Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7446376Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7446484Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7446486Z 2025-12-04T13:44:25.7446719Z [rank1]:[W1204 13:25:11.629040998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7446910Z [rank2]:[W1204 13:25:12.017757343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7447084Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7447339Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7447534Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7447902Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7448106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7448212Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7448309Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7448406Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7448408Z 2025-12-04T13:44:25.7448641Z [rank2]:[W1204 13:25:12.019798138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7448813Z [rank3]:[W1204 13:25:12.020682448 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7448987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7449243Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7449407Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7449772Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7449989Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7450094Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7450189Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7450286Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7450288Z 2025-12-04T13:44:25.7450534Z [rank3]:[W1204 13:25:12.021914281 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7450728Z [rank1]:[W1204 13:25:12.629211111 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7450902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7451157Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7451319Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7451688Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7451892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7451995Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7452093Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7452188Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7452190Z 2025-12-04T13:44:25.7452423Z [rank1]:[W1204 13:25:12.631082869 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7452593Z [rank2]:[W1204 13:25:13.020017519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7452766Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7453022Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7453184Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7453550Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7453762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7453867Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7453962Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7454058Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7454060Z 2025-12-04T13:44:25.7454315Z [rank2]:[W1204 13:25:13.022333298 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7454485Z [rank3]:[W1204 13:25:13.022030024 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7454679Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7454936Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7455098Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7455463Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7455664Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7455769Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7455863Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7455959Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7455961Z 2025-12-04T13:44:25.7456193Z [rank3]:[W1204 13:25:13.023722627 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7456363Z [rank1]:[W1204 13:25:13.631224632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7456537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7456794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7456955Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7457322Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7457565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7457683Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7457779Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7457874Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7457876Z 2025-12-04T13:44:25.7458109Z [rank1]:[W1204 13:25:13.633384835 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7458291Z [rank2]:[W1204 13:25:14.022472941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7458479Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7458752Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7458915Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7459285Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7459485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7459592Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7459689Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7459785Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7459787Z 2025-12-04T13:44:25.7460020Z [rank2]:[W1204 13:25:14.024721521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7460191Z [rank3]:[W1204 13:25:14.023877910 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7460366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7460622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7460786Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7461152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7461353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7461459Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7461552Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7461658Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7461660Z 2025-12-04T13:44:25.7461892Z [rank3]:[W1204 13:25:14.026051002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7462061Z [rank1]:[W1204 13:25:14.633572818 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7462244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7462518Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7462687Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7463051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7463252Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7463356Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7463452Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7463548Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7463550Z 2025-12-04T13:44:25.7463782Z [rank1]:[W1204 13:25:14.636019753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7463951Z [rank2]:[W1204 13:25:15.024855496 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7464125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7464380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7464544Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7464913Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7465116Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7465220Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7465317Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7465413Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7465427Z 2025-12-04T13:44:25.7465660Z [rank2]:[W1204 13:25:15.026959049 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7465829Z [rank3]:[W1204 13:25:15.026167457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7466015Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7466270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7466455Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7466827Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7467028Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7467134Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7467228Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7467326Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7467328Z 2025-12-04T13:44:25.7467600Z [rank3]:[W1204 13:25:15.028329069 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7467772Z [rank1]:[W1204 13:25:15.636173647 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7467946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7468199Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7468363Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7468731Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7468934Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7469039Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7469135Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7469232Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7469235Z 2025-12-04T13:44:25.7469489Z [rank1]:[W1204 13:25:15.637395210 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7469659Z [rank2]:[W1204 13:25:16.027144623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7469833Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7470101Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7470278Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7470662Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7470862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7470967Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7471064Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7471160Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7471163Z 2025-12-04T13:44:25.7471401Z [rank2]:[W1204 13:25:16.028969912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7471570Z [rank3]:[W1204 13:25:16.028462614 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7471748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7472002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7472165Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7472532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7472731Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7472834Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7472930Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7473026Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7473029Z 2025-12-04T13:44:25.7473262Z [rank3]:[W1204 13:25:16.030590656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7473445Z [rank1]:[W1204 13:25:16.637552965 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7473621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7473876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7474049Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7478452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7478654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7478757Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7478853Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7478948Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7478951Z 2025-12-04T13:44:25.7479184Z [rank1]:[W1204 13:25:16.638800527 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7479358Z [rank2]:[W1204 13:25:17.029115537 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7479533Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7479789Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7479951Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7480318Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7480522Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7480627Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7480723Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7480819Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7480821Z 2025-12-04T13:44:25.7481054Z [rank2]:[W1204 13:25:17.032221149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7481240Z [rank3]:[W1204 13:25:17.030758571 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7481416Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7481674Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7481855Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7482221Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7482446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7482551Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7482646Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7482741Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7482743Z 2025-12-04T13:44:25.7482975Z [rank3]:[W1204 13:25:17.032869344 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7483146Z [rank1]:[W1204 13:25:17.638956163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7483321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7483574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7483737Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7484106Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7484311Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7484415Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7484510Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7484606Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7484609Z 2025-12-04T13:44:25.7484843Z [rank1]:[W1204 13:25:17.640201345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7485015Z [rank2]:[W1204 13:25:18.032369554 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7485200Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7485458Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7485619Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7486000Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7486233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7486337Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7486433Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7486529Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7486531Z 2025-12-04T13:44:25.7486765Z [rank2]:[W1204 13:25:18.034801680 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7486933Z [rank3]:[W1204 13:25:18.033048189 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7487111Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7487368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7487571Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7487937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7488139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7488246Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7488341Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7488437Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7488439Z 2025-12-04T13:44:25.7488671Z [rank3]:[W1204 13:25:18.035164932 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7488841Z [rank1]:[W1204 13:25:18.640339311 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7489017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7489285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7489447Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7489830Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7490031Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7490161Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7490258Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7490354Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7490356Z 2025-12-04T13:44:25.7490588Z [rank1]:[W1204 13:25:18.641753620 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7490759Z [rank2]:[W1204 13:25:19.034934487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7490933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7491189Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7491351Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7491717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7491918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7492024Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7492120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7492215Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7492216Z 2025-12-04T13:44:25.7492449Z [rank2]:[W1204 13:25:19.036927093 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7492619Z [rank3]:[W1204 13:25:19.035318859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7492795Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7493062Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7493224Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7493590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7493798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7493923Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7494018Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7494114Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7494116Z 2025-12-04T13:44:25.7494348Z [rank3]:[W1204 13:25:19.037220136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7494517Z [rank1]:[W1204 13:25:19.641852598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7494692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7494951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7495114Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7495478Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7495680Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7495785Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7495883Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7495980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7495982Z 2025-12-04T13:44:25.7496214Z [rank1]:[W1204 13:25:19.643159089 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7496383Z [rank2]:[W1204 13:25:20.037017001 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7496557Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7496813Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7496988Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7497357Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7497603Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7497708Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7497828Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7497923Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7497925Z 2025-12-04T13:44:25.7498158Z [rank2]:[W1204 13:25:20.039286081 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7498327Z [rank3]:[W1204 13:25:20.037396953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7498502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7498756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7498920Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7499294Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7499493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7499599Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7499693Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7499790Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7499792Z 2025-12-04T13:44:25.7500025Z [rank3]:[W1204 13:25:20.039395588 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7500195Z [rank1]:[W1204 13:25:20.643332205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7500373Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7500627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7500792Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7501169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7501371Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7501486Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7501581Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7501702Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7501704Z 2025-12-04T13:44:25.7501935Z [rank1]:[W1204 13:25:20.645439389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7502106Z [rank2]:[W1204 13:25:21.039447888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7502280Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7502536Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7502698Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7503069Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7503271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7503373Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7503471Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7503567Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7503571Z 2025-12-04T13:44:25.7503806Z [rank2]:[W1204 13:25:21.041026853 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7503975Z [rank3]:[W1204 13:25:21.039572565 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7504150Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7504405Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7504566Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7504947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7505147Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7505251Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7505357Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7505453Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7505465Z 2025-12-04T13:44:25.7505710Z [rank3]:[W1204 13:25:21.042073750 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7505881Z [rank1]:[W1204 13:25:21.645597836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7506055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7506307Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7506470Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7506837Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7507038Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7507141Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7507236Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7507332Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7507334Z 2025-12-04T13:44:25.7507600Z [rank1]:[W1204 13:25:21.647566373 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7507772Z [rank2]:[W1204 13:25:22.041177251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7507946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7508202Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7508364Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7508744Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7508948Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7509051Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7509146Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7509241Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7509256Z 2025-12-04T13:44:25.7509489Z [rank2]:[W1204 13:25:22.043408021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7509684Z [rank3]:[W1204 13:25:22.042249007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7509859Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7510120Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7510284Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7510652Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7510856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7510959Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7511054Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7511150Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7511152Z 2025-12-04T13:44:25.7511386Z [rank3]:[W1204 13:25:22.043713315 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7511555Z [rank1]:[W1204 13:25:22.647743380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7511730Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7511983Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7512146Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7512514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7512726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7512831Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7512926Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7513022Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7513024Z 2025-12-04T13:44:25.7513268Z [rank1]:[W1204 13:25:22.649401414 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7513438Z [rank3]:[W1204 13:25:23.043877863 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7513632Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7513889Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7514052Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7514425Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7514631Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7514734Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7514829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7514925Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7514927Z 2025-12-04T13:44:25.7515100Z [rank2]:[W1204 13:25:23.043981121 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7515274Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7515530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7515694Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7516060Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7516264Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7516368Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7516465Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7516570Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7516572Z 2025-12-04T13:44:25.7516808Z [rank3]:[W1204 13:25:23.045644954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7517038Z [rank2]:[W1204 13:25:23.045652744 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7517221Z [rank1]:[W1204 13:25:23.649688630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7517419Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7517722Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7517884Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7518251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7518453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7518559Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7518655Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7518751Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7518753Z 2025-12-04T13:44:25.7518987Z [rank1]:[W1204 13:25:23.651426141 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7519158Z [rank3]:[W1204 13:25:24.045765143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7519333Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7519591Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7519753Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7520121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7520322Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7520427Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7520537Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7520633Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7520635Z 2025-12-04T13:44:25.7520867Z [rank3]:[W1204 13:25:24.047716780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7521048Z [rank2]:[W1204 13:25:24.045793013 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7521224Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7521503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7521665Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7522032Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7522233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7522340Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7522435Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7522533Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7522535Z 2025-12-04T13:44:25.7522769Z [rank2]:[W1204 13:25:24.047835968 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7522938Z [rank1]:[W1204 13:25:24.651606480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7523116Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7523375Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7523541Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7523906Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7524107Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7524211Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7524307Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7524415Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7524417Z 2025-12-04T13:44:25.7524648Z [rank1]:[W1204 13:25:24.653345121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7524819Z [rank3]:[W1204 13:25:25.047875630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7525002Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7525259Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7525444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7525810Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7526012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7526116Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7526213Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7526307Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7526309Z 2025-12-04T13:44:25.7526543Z [rank3]:[W1204 13:25:25.050139929 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7526712Z [rank2]:[W1204 13:25:25.048517095 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7526887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7527144Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7527309Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7527721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7527921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7528027Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7528122Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7528221Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7528223Z 2025-12-04T13:44:25.7528471Z [rank2]:[W1204 13:25:25.050392044 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7528640Z [rank1]:[W1204 13:25:25.653503171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7528815Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7529085Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7529270Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7529637Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7529840Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7529946Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7530040Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7530137Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7530140Z 2025-12-04T13:44:25.7530373Z [rank1]:[W1204 13:25:25.654917730 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7530544Z [rank3]:[W1204 13:25:26.050315799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7530717Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7530974Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7531137Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7531504Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7531706Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7531809Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7531907Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7532003Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7532006Z 2025-12-04T13:44:25.7532255Z [rank3]:[W1204 13:25:26.052495351 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7532425Z [rank2]:[W1204 13:25:26.050531354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7532601Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7532870Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7533032Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7533417Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7533617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7533721Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7533817Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7533914Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7533916Z 2025-12-04T13:44:25.7534158Z [rank2]:[W1204 13:25:26.052875152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7534327Z [rank1]:[W1204 13:25:26.655063850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7534501Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7534754Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7534919Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7535285Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7535485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7535589Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7535684Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7535782Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7535783Z 2025-12-04T13:44:25.7536014Z [rank1]:[W1204 13:25:26.656328272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7536199Z [rank3]:[W1204 13:25:27.052628082 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7536375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7536632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7536805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7537170Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7537393Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7537540Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7537636Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7537731Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7537733Z 2025-12-04T13:44:25.7537966Z [rank3]:[W1204 13:25:27.054793474 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7538136Z [rank2]:[W1204 13:25:27.052988744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7538311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7538570Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7538734Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7539100Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7539303Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7539409Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7539504Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7539601Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7539604Z 2025-12-04T13:44:25.7539841Z [rank2]:[W1204 13:25:27.054938371 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7540014Z [rank1]:[W1204 13:25:27.656456794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7540204Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7540457Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7540621Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7541001Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7541226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7541330Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7541424Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7541520Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7541522Z 2025-12-04T13:44:25.7541754Z [rank1]:[W1204 13:25:27.657964480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7541924Z [rank3]:[W1204 13:25:28.054958355 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7542099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7542355Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7542518Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7542887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7543090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7543194Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7543291Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7543386Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7543387Z 2025-12-04T13:44:25.7543621Z [rank3]:[W1204 13:25:28.056991720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7543792Z [rank2]:[W1204 13:25:28.055080612 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7543966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7544233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7544395Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7544770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7544994Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7545099Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7545197Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7545294Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7545296Z 2025-12-04T13:44:25.7545530Z [rank2]:[W1204 13:25:28.057403851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7545699Z [rank1]:[W1204 13:25:28.658106502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7545876Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7546131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7546294Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7546663Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7546864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7546970Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7547065Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7547162Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7547164Z 2025-12-04T13:44:25.7547399Z [rank1]:[W1204 13:25:28.659605299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7547618Z [rank3]:[W1204 13:25:29.057163161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7547794Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7548069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7548233Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7548611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7548812Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7548939Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7549036Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7549131Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7549133Z 2025-12-04T13:44:25.7549365Z [rank3]:[W1204 13:25:29.059495710 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7549537Z [rank2]:[W1204 13:25:29.057539333 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7549711Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7549969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7550131Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7550499Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7550700Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7550806Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7550903Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7551000Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7551001Z 2025-12-04T13:44:25.7551235Z [rank2]:[W1204 13:25:29.060069617 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7551403Z [rank1]:[W1204 13:25:29.659743142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7551579Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7551836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7552009Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7552375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7552587Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7552691Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7552804Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7552900Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7552902Z 2025-12-04T13:44:25.7553132Z [rank1]:[W1204 13:25:29.660981884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7553302Z [rank3]:[W1204 13:25:30.059637492 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7553478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7553736Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7553904Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7554268Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7554471Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7554573Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7554671Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7554766Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7554768Z 2025-12-04T13:44:25.7555002Z [rank3]:[W1204 13:25:30.061181018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7555171Z [rank2]:[W1204 13:25:30.060241549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7555345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7555601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7555764Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7556151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7556351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7556474Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7556570Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7556685Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7556687Z 2025-12-04T13:44:25.7556920Z [rank2]:[W1204 13:25:30.062349042 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7557089Z [rank1]:[W1204 13:25:30.661091548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7557263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7557557Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7557722Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7558088Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7558289Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7558394Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7558489Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7558585Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7558588Z 2025-12-04T13:44:25.7558821Z [rank1]:[W1204 13:25:30.662354160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7558991Z [rank3]:[W1204 13:25:31.061309912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7559164Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7559422Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7559585Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7559964Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7560166Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7560270Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7560381Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7560478Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7560513Z 2025-12-04T13:44:25.7560746Z [rank3]:[W1204 13:25:31.063763857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7560916Z [rank2]:[W1204 13:25:31.062499165 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7561089Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7561344Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7561507Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7561877Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7562078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7562182Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7562279Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7562376Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7562378Z 2025-12-04T13:44:25.7562616Z [rank2]:[W1204 13:25:31.064714776 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7562824Z [rank2]:W1204 13:25:31.655000 68235 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.7562995Z [rank1]:[W1204 13:25:31.662497384 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7563168Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7563423Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7563587Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7563961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7564163Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7564276Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7564372Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7564467Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7564488Z 2025-12-04T13:44:25.7564725Z [rank1]:[W1204 13:25:31.663749756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7564933Z [rank3]:W1204 13:25:32.113000 68236 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.7565101Z [rank3]:[W1204 13:25:32.063956840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7565276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7565530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7565696Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7566060Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7566262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7566368Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7566464Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7566560Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7566562Z 2025-12-04T13:44:25.7566795Z [rank3]:[W1204 13:25:32.066151171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7566967Z [rank2]:[W1204 13:25:32.064878280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7567141Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7567397Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7567591Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7567976Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7568178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7568294Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7568390Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7568510Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7568512Z 2025-12-04T13:44:25.7568745Z [rank2]:[W1204 13:25:32.066314038 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7568914Z [rank1]:[W1204 13:25:32.664157144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7569089Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7569348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7569511Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7569876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7570076Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7570181Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7570276Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7570372Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7570376Z 2025-12-04T13:44:25.7570610Z [rank1]:[W1204 13:25:32.665673090 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7570780Z [rank2]:[W1204 13:25:33.066483301 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7570954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7571209Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7571373Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7571752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7571955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7572062Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7572167Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7572265Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7572287Z 2025-12-04T13:44:25.7572519Z [rank2]:[W1204 13:25:33.067899270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7572692Z [rank3]:[W1204 13:25:33.066327625 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7572865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7573120Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7573281Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7573651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7573853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7573956Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7574051Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7574146Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7574148Z 2025-12-04T13:44:25.7574381Z [rank3]:[W1204 13:25:33.068548986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7574552Z [rank1]:[W1204 13:25:33.665808615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7574726Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7574980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7575142Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7575516Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7575720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7575825Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7575921Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7576035Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7576037Z 2025-12-04T13:44:25.7576270Z [rank1]:[W1204 13:25:33.667062848 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7576460Z [rank2]:[W1204 13:25:34.068076144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7576635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7576890Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7577053Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7577421Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7577645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7577750Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7577845Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7577944Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7577945Z 2025-12-04T13:44:25.7578178Z [rank2]:[W1204 13:25:34.069719378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7578352Z [rank3]:[W1204 13:25:34.068713450 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7578525Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7578781Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7578943Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7579314Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7579538Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7579642Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7579738Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7579832Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7579834Z 2025-12-04T13:44:25.7580083Z [rank3]:[W1204 13:25:34.071140057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7580276Z [rank1]:[W1204 13:25:34.667217083 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7580453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7580709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7580870Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7581239Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7581444Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7581548Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7581642Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7581739Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7581740Z 2025-12-04T13:44:25.7581974Z [rank1]:[W1204 13:25:34.668544673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7582144Z [rank2]:[W1204 13:25:35.069863583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7582321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7582576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7582739Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7583106Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7583309Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7583426Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7583521Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7583618Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7583620Z 2025-12-04T13:44:25.7583861Z [rank2]:[W1204 13:25:35.071171024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7584032Z [rank3]:[W1204 13:25:35.071310941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7584225Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7584483Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7584646Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7585011Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7585214Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7585318Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7585413Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7585508Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7585510Z 2025-12-04T13:44:25.7585744Z [rank3]:[W1204 13:25:35.073582681 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7585915Z [rank1]:[W1204 13:25:35.668720508 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7586092Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7586352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7586514Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7586888Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7587088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7587194Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7587298Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7587395Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7587397Z 2025-12-04T13:44:25.7587669Z [rank1]:[W1204 13:25:35.670729314 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7587853Z [rank2]:[W1204 13:25:36.071327700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7588028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7588310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7588473Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7588844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7589048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7589155Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7589251Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7589348Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7589350Z 2025-12-04T13:44:25.7589583Z [rank2]:[W1204 13:25:36.072784138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7589756Z [rank3]:[W1204 13:25:36.073743327 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7589930Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7590188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7590353Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7590717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7590919Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7591023Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7591122Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7591235Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7591237Z 2025-12-04T13:44:25.7591468Z [rank3]:[W1204 13:25:36.075941468 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7591636Z [rank1]:[W1204 13:25:36.670883660 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7591820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7592076Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7596557Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7596934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7597139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7597246Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7597345Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7597444Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7597447Z 2025-12-04T13:44:25.7597722Z [rank1]:[W1204 13:25:36.672178021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7597895Z [rank2]:[W1204 13:25:37.072940924 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7598074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7598329Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7598497Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7598865Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7599070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7599176Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7599271Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7599369Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7599372Z 2025-12-04T13:44:25.7599637Z [rank2]:[W1204 13:25:37.074472150 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7599810Z [rank3]:[W1204 13:25:37.076104954 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7599985Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7600256Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7600447Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7600814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7601016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7601121Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7601217Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7601314Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7601316Z 2025-12-04T13:44:25.7601551Z [rank3]:[W1204 13:25:37.078187308 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7601720Z [rank1]:[W1204 13:25:37.672313328 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7601896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7602156Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7602321Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7602686Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7602887Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7602992Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7603087Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7603186Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7603189Z 2025-12-04T13:44:25.7603437Z [rank1]:[W1204 13:25:37.673902493 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7603607Z [rank2]:[W1204 13:25:38.074644576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7603783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7604046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7604212Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7604598Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7604800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7604904Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7605001Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7605098Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7605102Z 2025-12-04T13:44:25.7605334Z [rank2]:[W1204 13:25:38.076552624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7605505Z [rank3]:[W1204 13:25:38.078400623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7605679Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7605935Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7606099Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7606468Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7606670Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7606773Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7606868Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7606965Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7606967Z 2025-12-04T13:44:25.7607200Z [rank3]:[W1204 13:25:38.080480347 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7607382Z [rank1]:[W1204 13:25:38.674057930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7607600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7607854Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7608028Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7608395Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7608629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7608734Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7608829Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7608927Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7608929Z 2025-12-04T13:44:25.7609162Z [rank1]:[W1204 13:25:38.675329082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7609334Z [rank2]:[W1204 13:25:39.076720161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7609509Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7609761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7609926Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7610291Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7610495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7610599Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7610694Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7610791Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7610793Z 2025-12-04T13:44:25.7611027Z [rank2]:[W1204 13:25:39.078232948 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7611198Z [rank3]:[W1204 13:25:39.080916758 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7611384Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7611642Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7611804Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7612179Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7612400Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7612503Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7612600Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7612694Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7612696Z 2025-12-04T13:44:25.7612933Z [rank3]:[W1204 13:25:39.083033271 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7613107Z [rank1]:[W1204 13:25:39.675476710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7613283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7613538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7613699Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7614067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7614270Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7614375Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7614470Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7614565Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7614567Z 2025-12-04T13:44:25.7614804Z [rank1]:[W1204 13:25:39.676733242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7614974Z [rank2]:[W1204 13:25:40.078394595 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7615152Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7615416Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7615581Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7615961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7616182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7616288Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7616382Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7616479Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7616480Z 2025-12-04T13:44:25.7616713Z [rank2]:[W1204 13:25:40.080016939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7616883Z [rank3]:[W1204 13:25:40.083221488 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7617058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7617318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7617520Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7617887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7618090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7618195Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7618290Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7618385Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7618387Z 2025-12-04T13:44:25.7618621Z [rank3]:[W1204 13:25:40.085460759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7618792Z [rank1]:[W1204 13:25:40.676877400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7618966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7619233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7619394Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7619779Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7619980Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7620109Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7620204Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7620299Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7620301Z 2025-12-04T13:44:25.7620534Z [rank1]:[W1204 13:25:40.678154942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7620704Z [rank2]:[W1204 13:25:41.080175698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7620879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7621135Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7621300Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7621673Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7621877Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7621983Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7622077Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7622176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7622177Z 2025-12-04T13:44:25.7622409Z [rank2]:[W1204 13:25:41.082215213 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7622581Z [rank3]:[W1204 13:25:41.085638447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7622755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7623013Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7623185Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7623552Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7623763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7623868Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7623991Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7624088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7624089Z 2025-12-04T13:44:25.7624322Z [rank3]:[W1204 13:25:41.087676482 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7624491Z [rank1]:[W1204 13:25:41.678305371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7624665Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7624921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7625085Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7625454Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7625658Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7625763Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7625860Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7625955Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7625958Z 2025-12-04T13:44:25.7626193Z [rank1]:[W1204 13:25:41.679809617 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7626362Z [rank2]:[W1204 13:25:42.082364612 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7626537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7626790Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7626964Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7627331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7627572Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7627694Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7627789Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7627912Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7627914Z 2025-12-04T13:44:25.7628149Z [rank2]:[W1204 13:25:42.083611954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7628320Z [rank3]:[W1204 13:25:42.087876840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7628493Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7628750Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7628913Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7629279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7629481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7629584Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7629679Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7629776Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7629778Z 2025-12-04T13:44:25.7630013Z [rank3]:[W1204 13:25:42.090211748 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7630183Z [rank1]:[W1204 13:25:42.680009726 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7630357Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7630614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7630777Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7631157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7631358Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7631463Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7631570Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7631665Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7631686Z 2025-12-04T13:44:25.7631920Z [rank1]:[W1204 13:25:42.682005311 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7632089Z [rank2]:[W1204 13:25:43.083765194 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7632264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7632519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7632683Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7633054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7633255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7633359Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7633454Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7633551Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7633554Z 2025-12-04T13:44:25.7633787Z [rank2]:[W1204 13:25:43.084972437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7633957Z [rank3]:[W1204 13:25:43.090691360 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7634132Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7634389Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7634552Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7634938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7635140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7635243Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7635339Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7635443Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7635446Z 2025-12-04T13:44:25.7635678Z [rank3]:[W1204 13:25:43.093035498 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7635869Z [rank1]:[W1204 13:25:43.682167071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7636043Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7636299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7636463Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7636832Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7637036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7637139Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7637235Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7637332Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7637334Z 2025-12-04T13:44:25.7637600Z [rank1]:[W1204 13:25:43.683411124 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7637771Z [rank2]:[W1204 13:25:44.085100428 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7637946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7638203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7638368Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7638738Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7638953Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7639059Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7639153Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7639250Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7639252Z 2025-12-04T13:44:25.7639503Z [rank2]:[W1204 13:25:44.086759861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7639698Z [rank3]:[W1204 13:25:44.093174629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7639873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7640127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7640289Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7640655Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7640858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7640963Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7641058Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7641153Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7641156Z 2025-12-04T13:44:25.7641392Z [rank3]:[W1204 13:25:44.095438839 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7641566Z [rank1]:[W1204 13:25:44.683561434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7641740Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7641995Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7642157Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7642523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7642738Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7642841Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7642936Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7643031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7643033Z 2025-12-04T13:44:25.7643278Z [rank1]:[W1204 13:25:44.684833436 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7643449Z [rank2]:[W1204 13:25:45.087342442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7643646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7643902Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7644064Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7644434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7644637Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7644743Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7644837Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7644933Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7644935Z 2025-12-04T13:44:25.7645169Z [rank2]:[W1204 13:25:45.088797260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7645338Z [rank3]:[W1204 13:25:45.095601349 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7645514Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7645772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7645934Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7646300Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7646501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7646616Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7646713Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7646808Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7646811Z 2025-12-04T13:44:25.7647043Z [rank3]:[W1204 13:25:45.097767861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7647223Z [rank1]:[W1204 13:25:45.684991707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7647396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7647708Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7647870Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7648242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7648444Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7648549Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7648645Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7648740Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7648742Z 2025-12-04T13:44:25.7648976Z [rank1]:[W1204 13:25:45.686257469 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7649145Z [rank2]:[W1204 13:25:46.088954451 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7649321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7649577Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7649740Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7650111Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7650314Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7650420Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7650514Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7650626Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7650628Z 2025-12-04T13:44:25.7650862Z [rank2]:[W1204 13:25:46.090521696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7651032Z [rank3]:[W1204 13:25:46.097949361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7651218Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7651500Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7651663Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7652029Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7652232Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7652336Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7652433Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7652529Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7652532Z 2025-12-04T13:44:25.7652766Z [rank3]:[W1204 13:25:46.099730082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7652936Z [rank1]:[W1204 13:25:46.686419660 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7653110Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7653365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7653530Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7653897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7654099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7654202Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7654299Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7654395Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7654396Z 2025-12-04T13:44:25.7654651Z [rank1]:[W1204 13:25:46.687991285 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7654821Z [rank2]:[W1204 13:25:47.090631399 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7655005Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7655259Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7655440Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7655808Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7656008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7656113Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7656208Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7656306Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7656308Z 2025-12-04T13:44:25.7656541Z [rank2]:[W1204 13:25:47.091779483 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7656714Z [rank3]:[W1204 13:25:47.099907003 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7656888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7657143Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7657307Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7657711Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7657912Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7658016Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7658112Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7658210Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7658212Z 2025-12-04T13:44:25.7658458Z [rank3]:[W1204 13:25:47.101987647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7658630Z [rank1]:[W1204 13:25:47.688120408 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7658803Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7659070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7659232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7659624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7659826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7659930Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7660026Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7660122Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7660126Z 2025-12-04T13:44:25.7660360Z [rank1]:[W1204 13:25:47.689877109 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7660529Z [rank2]:[W1204 13:25:48.091933426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7660705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7660962Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7661128Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7661497Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7661699Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7661803Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7661898Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7661994Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7661997Z 2025-12-04T13:44:25.7662231Z [rank2]:[W1204 13:25:48.093491221 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7662409Z [rank3]:[W1204 13:25:48.102266917 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7662585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7662842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7663017Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7663404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7663605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7663707Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7663803Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7663900Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7663902Z 2025-12-04T13:44:25.7664134Z [rank3]:[W1204 13:25:48.103914470 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7664307Z [rank1]:[W1204 13:25:48.690060311 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7664482Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7664738Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7664901Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7665269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7665475Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7665578Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7665674Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7665770Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7665771Z 2025-12-04T13:44:25.7666005Z [rank1]:[W1204 13:25:48.692030027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7666185Z [rank2]:[W1204 13:25:49.093622124 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7666359Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7666615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7666785Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7667155Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7667381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7667518Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7667613Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7667710Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7667712Z 2025-12-04T13:44:25.7667946Z [rank2]:[W1204 13:25:49.095688079 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7668118Z [rank3]:[W1204 13:25:49.104088813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7668294Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7668548Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7668711Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7669077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7669281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7669385Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7669479Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7669575Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7669579Z 2025-12-04T13:44:25.7669815Z [rank3]:[W1204 13:25:49.105979741 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7669987Z [rank1]:[W1204 13:25:49.692193940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7670177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7670433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7670595Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7670974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7671198Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7671301Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7671397Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7671491Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7671493Z 2025-12-04T13:44:25.7671727Z [rank1]:[W1204 13:25:49.693860983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7671897Z [rank2]:[W1204 13:25:50.095851632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7672075Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7672333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7672497Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7672865Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7673068Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7673174Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7673268Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7673366Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7673368Z 2025-12-04T13:44:25.7673600Z [rank2]:[W1204 13:25:50.097831908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7673769Z [rank3]:[W1204 13:25:50.106151764 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7673946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7674213Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7674377Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7674757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7674959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7675081Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7675177Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7675274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7675276Z 2025-12-04T13:44:25.7675508Z [rank3]:[W1204 13:25:50.108193598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7675678Z [rank1]:[W1204 13:25:50.694024536 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7675853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7676113Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7676277Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7676650Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7676853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7676959Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7677055Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7677150Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7677151Z 2025-12-04T13:44:25.7677384Z [rank1]:[W1204 13:25:50.696265137 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7677585Z [rank2]:[W1204 13:25:51.097987232 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7677761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7678031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7678195Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7678567Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7678784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7678913Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7679008Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7679106Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7679107Z 2025-12-04T13:44:25.7679341Z [rank2]:[W1204 13:25:51.099348151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7679510Z [rank3]:[W1204 13:25:51.108382522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7679685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7679941Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7680105Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7680472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7680675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7680780Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7680878Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7680975Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7680976Z 2025-12-04T13:44:25.7681209Z [rank3]:[W1204 13:25:51.110278879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7681378Z [rank1]:[W1204 13:25:51.696434451 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7681551Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7681806Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7681979Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7682346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7682558Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7682661Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7682784Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7682880Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7682881Z 2025-12-04T13:44:25.7683118Z [rank1]:[W1204 13:25:51.698695821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7683286Z [rank2]:[W1204 13:25:52.099496066 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7683462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7683718Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7683881Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7684249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7684449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7684554Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7684649Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7684747Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7684749Z 2025-12-04T13:44:25.7684985Z [rank2]:[W1204 13:25:52.101170899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7685155Z [rank3]:[W1204 13:25:52.110404514 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7685331Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7685590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7685755Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7686130Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7686333Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7686448Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7686542Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7686664Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7686665Z 2025-12-04T13:44:25.7686897Z [rank3]:[W1204 13:25:52.112491378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7687067Z [rank1]:[W1204 13:25:52.698869645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7687241Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7687530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7687691Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7688060Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7688261Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7688365Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7688461Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7688555Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7688559Z 2025-12-04T13:44:25.7688795Z [rank1]:[W1204 13:25:52.701027787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7688965Z [rank2]:[W1204 13:25:53.101282605 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7689140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7689400Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7689564Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7689952Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7690153Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7690258Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7690365Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7690464Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7690466Z 2025-12-04T13:44:25.7690725Z [rank2]:[W1204 13:25:53.102848230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7690894Z [rank3]:[W1204 13:25:53.112639383 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7691067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7691322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7691485Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7691859Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7692062Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7692166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7692260Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7692357Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7692359Z 2025-12-04T13:44:25.7692591Z [rank3]:[W1204 13:25:53.114387915 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7692763Z [rank1]:[W1204 13:25:53.701204902 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7692936Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7693193Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7693357Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7693735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7693940Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7694043Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7694138Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7694233Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7694244Z 2025-12-04T13:44:25.7694478Z [rank1]:[W1204 13:25:53.703261956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7694669Z [rank2]:[W1204 13:25:54.102960586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7694844Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7695098Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7695261Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7695626Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7695830Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7695935Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7696030Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7696127Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7696129Z 2025-12-04T13:44:25.7696363Z [rank2]:[W1204 13:25:54.104606920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7696534Z [rank3]:[W1204 13:25:54.114558040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7696709Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7696965Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7697127Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7697534Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7697752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7697856Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7697950Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7698049Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7698051Z 2025-12-04T13:44:25.7698315Z [rank3]:[W1204 13:25:54.116617454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7698487Z [rank1]:[W1204 13:25:54.703425642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7698687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7698941Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7699103Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7699470Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7699673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7699776Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7699871Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7699966Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7699968Z 2025-12-04T13:44:25.7700203Z [rank1]:[W1204 13:25:54.705456697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7700376Z [rank2]:[W1204 13:25:55.104765896 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7700552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7700807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7700968Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7701336Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7701537Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7701655Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7701750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7701846Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7701848Z 2025-12-04T13:44:25.7702081Z [rank2]:[W1204 13:25:55.106821800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7702260Z [rank3]:[W1204 13:25:55.116788530 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7702453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7702713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7702876Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7703244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7703445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7703551Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7703646Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7703742Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7703744Z 2025-12-04T13:44:25.7703975Z [rank3]:[W1204 13:25:55.118593430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7704148Z [rank1]:[W1204 13:25:55.705872457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7704323Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7704582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7704749Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7705118Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7705320Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7705424Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7705529Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7705624Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7705626Z 2025-12-04T13:44:25.7705859Z [rank1]:[W1204 13:25:55.708281314 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7706039Z [rank2]:[W1204 13:25:56.106950447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7706213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7706491Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7706654Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7707027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7707229Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7707336Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7707432Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7707570Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7707571Z 2025-12-04T13:44:25.7707805Z [rank2]:[W1204 13:25:56.109216917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7707973Z [rank3]:[W1204 13:25:56.118770226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7708148Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7708403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7708568Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7708933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7709135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7709239Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7709335Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7709443Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7709445Z 2025-12-04T13:44:25.7709678Z [rank3]:[W1204 13:25:56.120272442 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7709848Z [rank1]:[W1204 13:25:56.708443431 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7710033Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7710289Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7710476Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7710840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7711042Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7711146Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7711243Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7711339Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7711341Z 2025-12-04T13:44:25.7711574Z [rank1]:[W1204 13:25:56.710648142 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7711744Z [rank2]:[W1204 13:25:57.109377104 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7711917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7712175Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7712338Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7712706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7712906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7713012Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7713107Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7713204Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7713206Z 2025-12-04T13:44:25.7713457Z [rank2]:[W1204 13:25:57.110602137 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7713628Z [rank3]:[W1204 13:25:57.120731073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7713802Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7714071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7714254Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7714621Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7714821Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7714926Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7715020Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7715117Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7715121Z 2025-12-04T13:44:25.7715352Z [rank3]:[W1204 13:25:57.123069771 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7715524Z [rank1]:[W1204 13:25:57.710856658 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7715699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7715956Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7716121Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7716489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7716691Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7716795Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7716893Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7716989Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7716993Z 2025-12-04T13:44:25.7717236Z [rank1]:[W1204 13:25:57.712126090 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7717405Z [rank2]:[W1204 13:25:58.110756205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7717610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7717880Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7718046Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7718444Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7718646Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7718751Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7718847Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7718944Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7718946Z 2025-12-04T13:44:25.7719180Z [rank2]:[W1204 13:25:58.112131444 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7719349Z [rank3]:[W1204 13:25:58.123272767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7719523Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7719778Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7719941Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7720312Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7720513Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7720616Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7720710Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7720809Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7720810Z 2025-12-04T13:44:25.7721043Z [rank3]:[W1204 13:25:58.124978010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7721227Z [rank1]:[W1204 13:25:58.712309177 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7721402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7721657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7721831Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7722202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7722425Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7722528Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7722624Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7722719Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7722722Z 2025-12-04T13:44:25.7722954Z [rank1]:[W1204 13:25:58.713949921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7723125Z [rank2]:[W1204 13:25:59.112452468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7723299Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7723555Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7723718Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7724084Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7724287Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7724391Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7724488Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7724584Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7724586Z 2025-12-04T13:44:25.7724820Z [rank2]:[W1204 13:25:59.114158951 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7724990Z [rank3]:[W1204 13:25:59.125123698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7725177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7725433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7725596Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7725972Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7726197Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7726301Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7726396Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7726492Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7726494Z 2025-12-04T13:44:25.7726728Z [rank3]:[W1204 13:25:59.126319981 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7726900Z [rank1]:[W1204 13:25:59.714124609 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7727076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7727334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7727531Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7727901Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7728106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7728210Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7728307Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7728402Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7728405Z 2025-12-04T13:44:25.7728637Z [rank1]:[W1204 13:25:59.715611026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7728809Z [rank2]:[W1204 13:26:00.114323159 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7728985Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7729254Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7729417Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7729795Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7730020Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7730125Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7730221Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7730317Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7730319Z 2025-12-04T13:44:25.7730553Z [rank2]:[W1204 13:26:00.115581511 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7730722Z [rank3]:[W1204 13:26:00.126490610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7730899Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7731159Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7731322Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7731689Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7731889Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7731995Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7732090Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7732185Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7732187Z 2025-12-04T13:44:25.7732418Z [rank3]:[W1204 13:26:00.127719212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7732589Z [rank1]:[W1204 13:26:00.715767125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7732764Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7733029Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7733195Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7733571Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7733772Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7733898Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7733996Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7734091Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7734094Z 2025-12-04T13:44:25.7734325Z [rank1]:[W1204 13:26:00.717102855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7734497Z [rank2]:[W1204 13:26:01.115757310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7734669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7734927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7735090Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7735462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7735666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7735770Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7735866Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7735963Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7735964Z 2025-12-04T13:44:25.7736198Z [rank2]:[W1204 13:26:01.117249927 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7736368Z [rank3]:[W1204 13:26:01.127914671 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7736545Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7736801Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7736975Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7737340Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7737587Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7737691Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7737809Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7737906Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7737908Z 2025-12-04T13:44:25.7738140Z [rank3]:[W1204 13:26:01.129891707 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7738309Z [rank1]:[W1204 13:26:01.717277564 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7738485Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7738740Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7738904Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7739269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7739474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7739578Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7739677Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7739774Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7739777Z 2025-12-04T13:44:25.7740016Z [rank1]:[W1204 13:26:01.719357538 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7740185Z [rank2]:[W1204 13:26:02.117359777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7740357Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7740612Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7740776Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7741158Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7741360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7741473Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7741569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7741693Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7741695Z 2025-12-04T13:44:25.7741930Z [rank2]:[W1204 13:26:02.119337934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7742100Z [rank3]:[W1204 13:26:02.130070486 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7742274Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7742530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7742694Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7743067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7743267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7743372Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7743466Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7743563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7743566Z 2025-12-04T13:44:25.7743799Z [rank3]:[W1204 13:26:02.131367717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7743968Z [rank1]:[W1204 13:26:02.719548127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7744144Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7744401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7744564Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7744942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7745143Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7745246Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7745352Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7745449Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7745471Z 2025-12-04T13:44:25.7745703Z [rank1]:[W1204 13:26:02.721131412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7745874Z [rank2]:[W1204 13:26:03.119500354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7746047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7746306Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7746470Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7746839Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7747040Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7747143Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7747239Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7747336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7747337Z 2025-12-04T13:44:25.7747617Z [rank2]:[W1204 13:26:03.121278354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7747788Z [rank3]:[W1204 13:26:03.131547927 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7747964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7748219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7748383Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7748769Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7748971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7749076Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7749170Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7749279Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7749281Z 2025-12-04T13:44:25.7749515Z [rank3]:[W1204 13:26:03.132797329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7749717Z [rank1]:[W1204 13:26:03.721298402 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7749893Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7750147Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7750311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7750680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7750885Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7750989Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7751085Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7751181Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7751183Z 2025-12-04T13:44:25.7751417Z [rank1]:[W1204 13:26:03.723118662 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7751588Z [rank2]:[W1204 13:26:04.121450835 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7751762Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7752018Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7752181Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7752548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7752763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7752867Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7752964Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7753060Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7753062Z 2025-12-04T13:44:25.7753305Z [rank2]:[W1204 13:26:04.123039029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7753493Z [rank3]:[W1204 13:26:04.132997339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7753669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7753923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7754086Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7754452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7754655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7754759Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7754854Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7754950Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7754952Z 2025-12-04T13:44:25.7755188Z [rank3]:[W1204 13:26:04.135044934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7755359Z [rank1]:[W1204 13:26:04.723303862 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7755536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7755789Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7755951Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7756317Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7756519Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7756632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7756729Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7756824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7756826Z 2025-12-04T13:44:25.7757074Z [rank1]:[W1204 13:26:04.725287408 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7757247Z [rank2]:[W1204 13:26:05.123148142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7757440Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7757736Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7757898Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7758268Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7758473Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7758577Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7758673Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7758769Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7758770Z 2025-12-04T13:44:25.7759003Z [rank2]:[W1204 13:26:05.124363495 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7759173Z [rank3]:[W1204 13:26:05.135218644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7759350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7759609Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7759772Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7760141Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7760343Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7760450Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7760557Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7760654Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7760657Z 2025-12-04T13:44:25.7760889Z [rank3]:[W1204 13:26:05.136492246 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7761071Z [rank1]:[W1204 13:26:05.725468779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7761247Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7761528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7761690Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7762057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7762259Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7762365Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7762460Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7762557Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7762558Z 2025-12-04T13:44:25.7762789Z [rank1]:[W1204 13:26:05.727882166 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7762960Z [rank2]:[W1204 13:26:06.124499767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7763134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7763389Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7763552Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7763922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7764125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7764230Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7764326Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7764432Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7764434Z 2025-12-04T13:44:25.7764666Z [rank2]:[W1204 13:26:06.126779756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7764835Z [rank3]:[W1204 13:26:06.136658938 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7765019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7765276Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7765457Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7765822Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7766025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7766130Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7766226Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7766323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7766325Z 2025-12-04T13:44:25.7766559Z [rank3]:[W1204 13:26:06.138356320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7766728Z [rank1]:[W1204 13:26:06.728073437 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7766904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7767162Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7767326Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7767735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7767938Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7768043Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7768139Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7768238Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7768240Z 2025-12-04T13:44:25.7768484Z [rank1]:[W1204 13:26:06.729988985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7768655Z [rank2]:[W1204 13:26:07.126922699 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7768828Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7769099Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7769284Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7769651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7769853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7769958Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7770054Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7770151Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7770153Z 2025-12-04T13:44:25.7770387Z [rank2]:[W1204 13:26:07.128900425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7770557Z [rank3]:[W1204 13:26:07.138526442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7770734Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7770990Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7771154Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7771521Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7771721Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7771826Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7771924Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7772020Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7772023Z 2025-12-04T13:44:25.7772273Z [rank3]:[W1204 13:26:07.139749865 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7772444Z [rank1]:[W1204 13:26:07.730178236 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7772619Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7772884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7773047Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7773430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7773630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7773734Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7773831Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7773930Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7773934Z 2025-12-04T13:44:25.7774165Z [rank1]:[W1204 13:26:07.732265550 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7774336Z [rank2]:[W1204 13:26:08.129274823 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7774509Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7774767Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7774930Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7775299Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7775501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7775604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7775700Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7775796Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7775799Z 2025-12-04T13:44:25.7776032Z [rank2]:[W1204 13:26:08.131600462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7776212Z [rank3]:[W1204 13:26:08.139917238 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7776388Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7776643Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7776815Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7777186Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7777405Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7777549Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7777644Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7777742Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7777743Z 2025-12-04T13:44:25.7777976Z [rank3]:[W1204 13:26:08.141158030 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7778149Z [rank1]:[W1204 13:26:08.732454523 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7778323Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7778581Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7778745Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7779115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7779321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7779426Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7779520Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7779618Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7779619Z 2025-12-04T13:44:25.7779851Z [rank1]:[W1204 13:26:08.734896419 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7780025Z [rank2]:[W1204 13:26:09.131802804 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7780213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7780468Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7780629Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7781006Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7781232Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7781337Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7781433Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7781529Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7781531Z 2025-12-04T13:44:25.7781765Z [rank2]:[W1204 13:26:09.134192561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7781936Z [rank3]:[W1204 13:26:09.141362092 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7782112Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7782368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7782529Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7782897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7783100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7783204Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7783298Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7783395Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7783397Z 2025-12-04T13:44:25.7783635Z [rank3]:[W1204 13:26:09.143390407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7783805Z [rank1]:[W1204 13:26:09.735108101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7783981Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7784246Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7784409Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7784784Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7785011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7785116Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7785211Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7785307Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7785310Z 2025-12-04T13:44:25.7785545Z [rank1]:[W1204 13:26:09.737618195 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7785717Z [rank2]:[W1204 13:26:10.134320845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7785893Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7786151Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7786314Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7786679Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7786881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7786986Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7787082Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7787177Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7787179Z 2025-12-04T13:44:25.7787412Z [rank2]:[W1204 13:26:10.136642564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7787616Z [rank3]:[W1204 13:26:10.143589040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7787793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7788065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7788227Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7788609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7788810Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7788938Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7789034Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7789130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7789132Z 2025-12-04T13:44:25.7789365Z [rank3]:[W1204 13:26:10.145801881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7789534Z [rank1]:[W1204 13:26:10.737793529 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7789708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7789965Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7790129Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7790495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7790697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7790803Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7790898Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7790995Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7790996Z 2025-12-04T13:44:25.7791228Z [rank1]:[W1204 13:26:10.740150587 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7791400Z [rank2]:[W1204 13:26:11.136809568 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7791574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7791832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7792005Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7792377Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7792590Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7792704Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7792811Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7792908Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7792909Z 2025-12-04T13:44:25.7793144Z [rank2]:[W1204 13:26:11.139155796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7793314Z [rank3]:[W1204 13:26:11.145971055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7793489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7793746Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7793910Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7794278Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7794480Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7794585Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7794681Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7794777Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7794780Z 2025-12-04T13:44:25.7795014Z [rank3]:[W1204 13:26:11.147378624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7795183Z [rank1]:[W1204 13:26:11.740283882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7795357Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7795613Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7795785Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7796150Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7796350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7796466Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7796561Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7796675Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7796677Z 2025-12-04T13:44:25.7796909Z [rank1]:[W1204 13:26:11.742674879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7797079Z [rank2]:[W1204 13:26:12.139287301 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7797252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7797550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7797715Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7798081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7798282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7798386Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7798482Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7798580Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7798582Z 2025-12-04T13:44:25.7798820Z [rank2]:[W1204 13:26:12.141464593 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7798990Z [rank3]:[W1204 13:26:12.147489410 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7799163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7799419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7799581Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7799963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7800163Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7800268Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7800376Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7800471Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7800502Z 2025-12-04T13:44:25.7800736Z [rank3]:[W1204 13:26:12.148639954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7800905Z [rank1]:[W1204 13:26:12.742836114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7801081Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7801336Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7801500Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7801866Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7802066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7802171Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7802266Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7802362Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7802366Z 2025-12-04T13:44:25.7802598Z [rank1]:[W1204 13:26:12.745279610 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7802768Z [rank2]:[W1204 13:26:13.141619979 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7802942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7803200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7803364Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7803741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7803943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7804047Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7804143Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7804248Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7804250Z 2025-12-04T13:44:25.7804483Z [rank2]:[W1204 13:26:13.144117133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7804673Z [rank3]:[W1204 13:26:13.148801330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7804847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7805104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7805268Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7805638Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7805841Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7805946Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7806042Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7806138Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7806142Z 2025-12-04T13:44:25.7806374Z [rank3]:[W1204 13:26:13.150780566 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7806545Z [rank1]:[W1204 13:26:13.745390737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7806718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7806972Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7807137Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7807563Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7807779Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7807885Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7807979Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7808075Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7808076Z 2025-12-04T13:44:25.7808323Z [rank1]:[W1204 13:26:13.747874332 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7808519Z [rank2]:[W1204 13:26:14.144253390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7808693Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7808953Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7809118Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7809486Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7809693Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7809797Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7809894Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7809990Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7809992Z 2025-12-04T13:44:25.7810227Z [rank2]:[W1204 13:26:14.146515310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7810400Z [rank3]:[W1204 13:26:14.150922982 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7810574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7810831Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7810992Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7811358Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7811570Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7811675Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7811771Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7811868Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7811870Z 2025-12-04T13:44:25.7812116Z [rank3]:[W1204 13:26:14.152196664 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7812286Z [rank1]:[W1204 13:26:14.748030988 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7812481Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7812738Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7812901Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7813269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7813472Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7813577Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7813672Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7813769Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7813771Z 2025-12-04T13:44:25.7814007Z [rank1]:[W1204 13:26:14.750300897 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7814180Z [rank2]:[W1204 13:26:15.146642896 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7814355Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7814615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7814780Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7815149Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7815351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7815466Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7815562Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7815658Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7815660Z 2025-12-04T13:44:25.7815893Z [rank2]:[W1204 13:26:15.148897326 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7816083Z [rank3]:[W1204 13:26:15.152332731 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7816258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7816536Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7816698Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7817067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7817268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7817373Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7817506Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7817603Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7817604Z 2025-12-04T13:44:25.7817838Z [rank3]:[W1204 13:26:15.153728660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7818008Z [rank1]:[W1204 13:26:15.750480274 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7818181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7818439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7818602Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7818970Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7819171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7819277Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7819372Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7819484Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7819486Z 2025-12-04T13:44:25.7819718Z [rank1]:[W1204 13:26:15.752450080 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7819888Z [rank2]:[W1204 13:26:16.149028994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7820076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7820355Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7820520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7820890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7821093Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7821196Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7821294Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7821391Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7821394Z 2025-12-04T13:44:25.7821625Z [rank2]:[W1204 13:26:16.151318773 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7821796Z [rank3]:[W1204 13:26:16.153900476 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7821971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7822228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7822395Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7822764Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7822969Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7823072Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7823169Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7823263Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7823277Z 2025-12-04T13:44:25.7823510Z [rank3]:[W1204 13:26:16.155395503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7823678Z [rank1]:[W1204 13:26:16.752621677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7823861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7824117Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7824302Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7824669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7824869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7824975Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7825070Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7825168Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7825170Z 2025-12-04T13:44:25.7825402Z [rank1]:[W1204 13:26:16.754143673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7825574Z [rank2]:[W1204 13:26:17.151505400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7825748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7826004Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7826170Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7826535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7826737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7826841Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7826938Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7827037Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7827040Z 2025-12-04T13:44:25.7827286Z [rank2]:[W1204 13:26:17.153718451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7827457Z [rank3]:[W1204 13:26:17.155537641 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7827670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7827947Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7828123Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7828506Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7828708Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7828811Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7828910Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7829005Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7829009Z 2025-12-04T13:44:25.7829242Z [rank3]:[W1204 13:26:17.156754154 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7829414Z [rank1]:[W1204 13:26:17.754291611 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7829589Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7829845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7830009Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7830378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7830578Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7830682Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7830778Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7830875Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7830878Z 2025-12-04T13:44:25.7831111Z [rank1]:[W1204 13:26:17.755520474 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7831299Z [rank2]:[W1204 13:26:18.153848059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7831474Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7831733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7831908Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7832293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7832494Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7832598Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7832695Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7832793Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7832796Z 2025-12-04T13:44:25.7833028Z [rank2]:[W1204 13:26:18.156394093 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7833201Z [rank3]:[W1204 13:26:18.156929971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7833375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7833632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7833794Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7834163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7834365Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7834468Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7834563Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7834660Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7834662Z 2025-12-04T13:44:25.7834894Z [rank3]:[W1204 13:26:18.158680232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7835078Z [rank1]:[W1204 13:26:18.755574634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7835253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7835507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7835678Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7836050Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7836273Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7836378Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7836471Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7836567Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7836569Z 2025-12-04T13:44:25.7836802Z [rank1]:[W1204 13:26:18.756957924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7836974Z [rank2]:[W1204 13:26:19.156571101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7837149Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7837402Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7837607Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7837973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7838179Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7838286Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7838382Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7838477Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7838480Z 2025-12-04T13:44:25.7838714Z [rank2]:[W1204 13:26:19.158744413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7838885Z [rank3]:[W1204 13:26:19.158853310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7839074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7839331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7839494Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7839875Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7840101Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7840204Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7840300Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7840397Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7840399Z 2025-12-04T13:44:25.7840636Z [rank3]:[W1204 13:26:19.160667490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7840805Z [rank1]:[W1204 13:26:19.757126072 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7840983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7841239Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7841403Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7841771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7841974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7842079Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7842173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7842269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7842271Z 2025-12-04T13:44:25.7842506Z [rank1]:[W1204 13:26:19.759232155 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7842677Z [rank2]:[W1204 13:26:20.158914941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7842853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7843118Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7843281Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7843654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7847463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7847661Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7847761Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7847858Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7847862Z 2025-12-04T13:44:25.7848096Z [rank2]:[W1204 13:26:20.161296838 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7848270Z [rank3]:[W1204 13:26:20.160853178 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7848447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7848709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7848872Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7849241Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7849444Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7849550Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7849646Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7849742Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7849744Z 2025-12-04T13:44:25.7849977Z [rank3]:[W1204 13:26:20.162912293 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7850146Z [rank1]:[W1204 13:26:20.759356665 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7850322Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7850596Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7850761Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7851128Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7851341Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7851472Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7851567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7851665Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7851667Z 2025-12-04T13:44:25.7851899Z [rank1]:[W1204 13:26:20.761334241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7852072Z [rank2]:[W1204 13:26:21.161423539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7852249Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7852504Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7852669Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7853040Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7853244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7853348Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7853445Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7853543Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7853545Z 2025-12-04T13:44:25.7853778Z [rank2]:[W1204 13:26:21.163753077 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7853947Z [rank3]:[W1204 13:26:21.163048172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7854122Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7854378Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7854552Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7854922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7855134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7855238Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7855354Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7855449Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7855451Z 2025-12-04T13:44:25.7855683Z [rank3]:[W1204 13:26:21.164999379 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7855852Z [rank1]:[W1204 13:26:21.761475201 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7856028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7856284Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7856448Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7856817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7857018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7857124Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7857220Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7857318Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7857320Z 2025-12-04T13:44:25.7857598Z [rank1]:[W1204 13:26:21.762715194 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7857769Z [rank2]:[W1204 13:26:22.164460195 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7857944Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7858200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7858363Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7858747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7858950Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7859065Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7859162Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7859284Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7859286Z 2025-12-04T13:44:25.7859460Z [rank3]:[W1204 13:26:22.165152059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7859634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7859888Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7860050Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7860416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7860619Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7860724Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7860819Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7860915Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7860917Z 2025-12-04T13:44:25.7861150Z [rank2]:[W1204 13:26:22.166934360 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7861382Z [rank3]:[W1204 13:26:22.166942310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7861552Z [rank1]:[W1204 13:26:22.762870124 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7861727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7861982Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7862145Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7862524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7862724Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7862828Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7862932Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7863028Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7863030Z 2025-12-04T13:44:25.7863287Z [rank1]:[W1204 13:26:22.765041526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7863458Z [rank3]:[W1204 13:26:23.167094120 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7863633Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7863891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7864054Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7864420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7864621Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7864724Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7864819Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7864916Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7864917Z 2025-12-04T13:44:25.7865148Z [rank3]:[W1204 13:26:23.168304393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7865321Z [rank2]:[W1204 13:26:23.167083110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7865493Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7865748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7865911Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7866289Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7866492Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7866595Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7866691Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7866787Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7866798Z 2025-12-04T13:44:25.7867032Z [rank2]:[W1204 13:26:23.168567008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7867221Z [rank1]:[W1204 13:26:23.765464911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7867397Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7867683Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7867846Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7868220Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7868423Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7868528Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7868622Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7868719Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7868720Z 2025-12-04T13:44:25.7868953Z [rank1]:[W1204 13:26:23.767687682 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7869127Z [rank3]:[W1204 13:26:24.168452795 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7869304Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7869560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7869722Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7870088Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7870307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7870412Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7870507Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7870603Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7870605Z 2025-12-04T13:44:25.7870850Z [rank3]:[W1204 13:26:24.169662508 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7871023Z [rank2]:[W1204 13:26:24.168676550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7871223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7871479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7871643Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7872009Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7872213Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7872319Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7872415Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7872511Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7872513Z 2025-12-04T13:44:25.7872748Z [rank2]:[W1204 13:26:24.171487747 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7872917Z [rank1]:[W1204 13:26:24.767854483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7873095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7873351Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7873513Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7873884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7874087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7874200Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7874295Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7874391Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7874393Z 2025-12-04T13:44:25.7874628Z [rank1]:[W1204 13:26:24.769964106 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7874807Z [rank3]:[W1204 13:26:25.169817349 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7874999Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7875255Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7875417Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7875783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7875984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7876091Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7876186Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7876283Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7876285Z 2025-12-04T13:44:25.7876516Z [rank3]:[W1204 13:26:25.171185349 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7876687Z [rank2]:[W1204 13:26:25.171621519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7876862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7877121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7877284Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7877688Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7877891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7877997Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7878105Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7878201Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7878203Z 2025-12-04T13:44:25.7878435Z [rank2]:[W1204 13:26:25.174236881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7878623Z [rank1]:[W1204 13:26:25.770108328 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7878799Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7879082Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7879245Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7879618Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7879822Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7879928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7880022Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7880119Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7880121Z 2025-12-04T13:44:25.7880354Z [rank1]:[W1204 13:26:25.771898598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7880523Z [rank3]:[W1204 13:26:26.171345851 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7880698Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7880953Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7881118Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7881486Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7881688Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7881793Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7881889Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7881995Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7881997Z 2025-12-04T13:44:25.7882228Z [rank3]:[W1204 13:26:26.172791529 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7882398Z [rank2]:[W1204 13:26:26.174372354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7882580Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7882835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7883020Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7883391Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7883596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7883700Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7883797Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7883893Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7883895Z 2025-12-04T13:44:25.7884128Z [rank2]:[W1204 13:26:26.176090426 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7884296Z [rank1]:[W1204 13:26:26.772048771 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7884469Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7884724Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7884887Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7885252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7885453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7885558Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7885653Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7885752Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7885753Z 2025-12-04T13:44:25.7885996Z [rank1]:[W1204 13:26:26.774013697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7886165Z [rank3]:[W1204 13:26:27.172974510 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7886340Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7886606Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7886788Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7887153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7887355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7887461Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7887596Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7887695Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7887698Z 2025-12-04T13:44:25.7887931Z [rank3]:[W1204 13:26:27.174663603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7888102Z [rank2]:[W1204 13:26:27.176218519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7888276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7888531Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7888694Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7889065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7889266Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7889369Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7889465Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7889561Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7889564Z 2025-12-04T13:44:25.7889809Z [rank2]:[W1204 13:26:27.178579426 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7889982Z [rank1]:[W1204 13:26:27.774155190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7890157Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7890428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7890591Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7890983Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7891182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7891286Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7891380Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7891478Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7891480Z 2025-12-04T13:44:25.7891713Z [rank1]:[W1204 13:26:27.775657257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7891883Z [rank3]:[W1204 13:26:28.174812636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7892061Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7892318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7892480Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7892848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7893051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7893155Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7893249Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7893347Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7893349Z 2025-12-04T13:44:25.7893580Z [rank3]:[W1204 13:26:28.176043879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7893768Z [rank2]:[W1204 13:26:28.178727569 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7893942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7894197Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7894372Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7894740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7894961Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7895065Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7895161Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7895256Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7895259Z 2025-12-04T13:44:25.7895492Z [rank2]:[W1204 13:26:28.180959670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7895662Z [rank1]:[W1204 13:26:28.775785701 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7895837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7896091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7896253Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7896622Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7896825Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7896930Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7897024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7897121Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7897123Z 2025-12-04T13:44:25.7897355Z [rank1]:[W1204 13:26:28.778084710 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7897573Z [rank3]:[W1204 13:26:29.176202972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7897767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7898023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7898186Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7898565Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7898796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7898902Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7898996Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7899093Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7899095Z 2025-12-04T13:44:25.7899329Z [rank3]:[W1204 13:26:29.177729778 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7899498Z [rank2]:[W1204 13:26:29.181119423 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7899675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7899930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7900092Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7900460Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7900662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7900766Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7900862Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7900958Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7900959Z 2025-12-04T13:44:25.7901193Z [rank2]:[W1204 13:26:29.183378974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7901364Z [rank1]:[W1204 13:26:29.778184995 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7901538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7901803Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7901965Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7902344Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7902563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7902669Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7902763Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7902860Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7902861Z 2025-12-04T13:44:25.7903097Z [rank1]:[W1204 13:26:29.779334389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7903267Z [rank3]:[W1204 13:26:30.177914032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7903444Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7903700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7903863Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7904228Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7904428Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7904534Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7904629Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7904724Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7904726Z 2025-12-04T13:44:25.7904957Z [rank3]:[W1204 13:26:30.179156084 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7905129Z [rank2]:[W1204 13:26:30.183556467 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7905303Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7905571Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7905735Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7907196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7907402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7907558Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7907657Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7907754Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7907756Z 2025-12-04T13:44:25.7907990Z [rank2]:[W1204 13:26:30.185591742 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7908177Z [rank1]:[W1204 13:26:30.779494913 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7908351Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7908618Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7908787Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7909153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7909355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7909461Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7909556Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7909655Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7909657Z 2025-12-04T13:44:25.7909889Z [rank1]:[W1204 13:26:30.781183726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7910061Z [rank3]:[W1204 13:26:31.179322468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7910237Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7910493Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7910672Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7911039Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7911307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7911411Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7911522Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7911620Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7911622Z 2025-12-04T13:44:25.7911855Z [rank3]:[W1204 13:26:31.180757427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7912026Z [rank2]:[W1204 13:26:31.185753756 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7912200Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7912455Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7912619Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7912989Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7913194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7913297Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7913394Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7913490Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7913492Z 2025-12-04T13:44:25.7913727Z [rank2]:[W1204 13:26:31.187460328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7913896Z [rank1]:[W1204 13:26:31.781346281 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7914072Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7914327Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7914491Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7914867Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7915082Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7915198Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7915296Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7915404Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7915406Z 2025-12-04T13:44:25.7915638Z [rank1]:[W1204 13:26:31.783332957 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7915808Z [rank3]:[W1204 13:26:32.180891212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7915987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7916242Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7916406Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7916772Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7916976Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7917080Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7917175Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7917272Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7917275Z 2025-12-04T13:44:25.7917547Z [rank3]:[W1204 13:26:32.182342280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7917718Z [rank2]:[W1204 13:26:32.187579574 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7917894Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7918151Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7918312Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7918697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7918899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7919020Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7919128Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7919224Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7919239Z 2025-12-04T13:44:25.7919472Z [rank2]:[W1204 13:26:32.190015320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7919643Z [rank1]:[W1204 13:26:32.783505191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7919817Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7920073Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7920234Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7920604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7920805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7920911Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7921006Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7921107Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7921109Z 2025-12-04T13:44:25.7921340Z [rank1]:[W1204 13:26:32.786123014 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7921511Z [rank3]:[W1204 13:26:33.182498815 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7921686Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7921948Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7922114Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7922495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7922697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7922811Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7922906Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7923019Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7923021Z 2025-12-04T13:44:25.7923253Z [rank3]:[W1204 13:26:33.183736738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7923434Z [rank2]:[W1204 13:26:33.190155096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7923608Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7923864Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7924029Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7924397Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7924601Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7924705Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7924803Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7924900Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7924902Z 2025-12-04T13:44:25.7925135Z [rank2]:[W1204 13:26:33.192027844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7925305Z [rank1]:[W1204 13:26:33.786287409 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7925482Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7925739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7925902Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7926275Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7926487Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7926592Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7926698Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7926795Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7926796Z 2025-12-04T13:44:25.7927039Z [rank1]:[W1204 13:26:33.788220306 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7927222Z [rank3]:[W1204 13:26:34.183876414 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7927399Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7927691Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7927855Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7928221Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7928425Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7928530Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7928625Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7928722Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7928724Z 2025-12-04T13:44:25.7928959Z [rank3]:[W1204 13:26:34.185508248 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7929130Z [rank2]:[W1204 13:26:34.192224160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7929306Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7929563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7929728Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7930096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7930298Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7930414Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7930510Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7930606Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7930621Z 2025-12-04T13:44:25.7930867Z [rank2]:[W1204 13:26:34.194661686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7931037Z [rank1]:[W1204 13:26:34.788377342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7931225Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7931483Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7931644Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7932016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7932219Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7932324Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7932418Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7932515Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7932519Z 2025-12-04T13:44:25.7932757Z [rank1]:[W1204 13:26:34.789600595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7932928Z [rank3]:[W1204 13:26:35.185681214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7933103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7933358Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7933520Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7933889Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7934089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7934194Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7934299Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7934396Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7934397Z 2025-12-04T13:44:25.7934628Z [rank3]:[W1204 13:26:35.186931346 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7934823Z [rank2]:[W1204 13:26:35.194771253 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7934998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7935265Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7935427Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7935794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7935996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7936101Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7936197Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7936293Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7936295Z 2025-12-04T13:44:25.7936528Z [rank2]:[W1204 13:26:35.196841177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7936699Z [rank1]:[W1204 13:26:35.789764132 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7936872Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7937131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7937295Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7937709Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7937911Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7938017Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7938111Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7938220Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7938222Z 2025-12-04T13:44:25.7938454Z [rank1]:[W1204 13:26:35.791382186 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7938643Z [rank3]:[W1204 13:26:36.187101773 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7938830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7939086Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7939271Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7939642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7939845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7939950Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7940045Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7940141Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7940144Z 2025-12-04T13:44:25.7940376Z [rank3]:[W1204 13:26:36.188407434 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7940547Z [rank2]:[W1204 13:26:36.197034993 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7940721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7940977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7941141Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7941514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7941717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7941820Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7941917Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7942013Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7942015Z 2025-12-04T13:44:25.7942258Z [rank2]:[W1204 13:26:36.199108127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7942428Z [rank1]:[W1204 13:26:36.791482584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7942615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7942881Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7943052Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7943420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7943622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7943728Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7943823Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7943921Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7943922Z 2025-12-04T13:44:25.7944155Z [rank1]:[W1204 13:26:36.793250055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7944324Z [rank3]:[W1204 13:26:37.188580601 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7944499Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7944754Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7944918Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7945286Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7945488Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7945593Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7945689Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7945786Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7945789Z 2025-12-04T13:44:25.7946032Z [rank3]:[W1204 13:26:37.190415910 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7946203Z [rank2]:[W1204 13:26:37.199260215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7946388Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7946653Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7946815Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7947192Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7947394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7947534Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7947631Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7947727Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7947730Z 2025-12-04T13:44:25.7947966Z [rank2]:[W1204 13:26:37.201062895 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7948136Z [rank1]:[W1204 13:26:37.793424853 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7948310Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7948568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7948729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7949095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7949296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7949401Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7949496Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7949593Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7949595Z 2025-12-04T13:44:25.7949827Z [rank1]:[W1204 13:26:37.795829719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7950011Z [rank3]:[W1204 13:26:38.190575818 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7950189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7950459Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7950634Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7951002Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7951215Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7951320Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7951415Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7951512Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7951514Z 2025-12-04T13:44:25.7951745Z [rank3]:[W1204 13:26:38.191800941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7951917Z [rank2]:[W1204 13:26:38.201207733 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7952091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7952348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7952514Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7952881Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7953085Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7953188Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7953285Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7953381Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7953383Z 2025-12-04T13:44:25.7953616Z [rank2]:[W1204 13:26:38.203191889 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7953786Z [rank1]:[W1204 13:26:38.795983057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7953980Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7954236Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7954410Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7954788Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7954999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7955102Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7955197Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7955294Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7955296Z 2025-12-04T13:44:25.7955528Z [rank1]:[W1204 13:26:38.798422024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7955697Z [rank3]:[W1204 13:26:39.191944630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7955871Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7956126Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7956289Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7956659Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7956862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7956967Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7957062Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7957160Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7957162Z 2025-12-04T13:44:25.7957394Z [rank3]:[W1204 13:26:39.194296738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7957581Z [rank2]:[W1204 13:26:39.203340807 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7957755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7958031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7958207Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7958591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7958809Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7958913Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7959012Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7959108Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7959110Z 2025-12-04T13:44:25.7959342Z [rank2]:[W1204 13:26:39.205324544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7959512Z [rank1]:[W1204 13:26:39.798580422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7959686Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7959944Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7960105Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7960473Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7960676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7960781Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7960876Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7960973Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7960975Z 2025-12-04T13:44:25.7961211Z [rank1]:[W1204 13:26:39.800828042 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7961381Z [rank3]:[W1204 13:26:40.194391688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7961558Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7961822Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7961986Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7962373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7962574Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7962688Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7962783Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7962879Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7962881Z 2025-12-04T13:44:25.7963112Z [rank3]:[W1204 13:26:40.196943151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7963287Z [rank2]:[W1204 13:26:40.205489712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7963464Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7963723Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7963887Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7964254Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7964458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7964562Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7964659Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7964755Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7964757Z 2025-12-04T13:44:25.7964990Z [rank2]:[W1204 13:26:40.207344521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7965162Z [rank1]:[W1204 13:26:40.800967142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7965337Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7965595Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7965771Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7966137Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7966360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7966480Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7966576Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7966673Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7966675Z 2025-12-04T13:44:25.7966908Z [rank1]:[W1204 13:26:40.803386208 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7967079Z [rank3]:[W1204 13:26:41.197102690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7967255Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7967624Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7967794Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7968160Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7968363Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7968468Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7968563Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7968660Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7968663Z 2025-12-04T13:44:25.7968894Z [rank3]:[W1204 13:26:41.198645936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7969065Z [rank2]:[W1204 13:26:41.207501231 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7969239Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7969495Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7969678Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7970055Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7970269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7970384Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7970480Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7970589Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7970591Z 2025-12-04T13:44:25.7970823Z [rank2]:[W1204 13:26:41.208809991 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7970993Z [rank1]:[W1204 13:26:41.803570317 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7971166Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7971421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7971584Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7971958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7972162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7972269Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7972364Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7972461Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7972463Z 2025-12-04T13:44:25.7972696Z [rank1]:[W1204 13:26:41.806220099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7972866Z [rank3]:[W1204 13:26:42.198758057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7973042Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7973299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7973461Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7973837Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7974053Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7974159Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7974263Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7974361Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7974374Z 2025-12-04T13:44:25.7974607Z [rank3]:[W1204 13:26:42.201208793 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7974779Z [rank2]:[W1204 13:26:42.208963741 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7974954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7975211Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7975374Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7975742Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7975946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7976050Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7976147Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7976243Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7976246Z 2025-12-04T13:44:25.7976485Z [rank2]:[W1204 13:26:42.211132303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7976656Z [rank1]:[W1204 13:26:42.806366999 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7976830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7977088Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7977250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7977675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7977876Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7977995Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7978090Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7978198Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7978200Z 2025-12-04T13:44:25.7978434Z [rank1]:[W1204 13:26:42.809045780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7978618Z [rank3]:[W1204 13:26:43.201318644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7978793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7979052Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7979217Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7979587Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7979788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7979893Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7979990Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7980088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7980090Z 2025-12-04T13:44:25.7980322Z [rank3]:[W1204 13:26:43.203794589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7980496Z [rank2]:[W1204 13:26:43.211275354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7980671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7980929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7981095Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7981462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7981675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7981778Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7981884Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7981980Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7981982Z 2025-12-04T13:44:25.7982229Z [rank2]:[W1204 13:26:43.213538934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7982408Z [rank1]:[W1204 13:26:43.809213170 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7982583Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7982838Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7983002Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7983370Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7983572Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7983677Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7983774Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7983871Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7983873Z 2025-12-04T13:44:25.7984109Z [rank1]:[W1204 13:26:43.811419891 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7984278Z [rank3]:[W1204 13:26:44.203919911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7984453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7984707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7984871Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7985244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7985456Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7985562Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7985656Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7985762Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7985764Z 2025-12-04T13:44:25.7986005Z [rank3]:[W1204 13:26:44.206082313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7986177Z [rank2]:[W1204 13:26:44.213689575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7986361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7986617Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7986783Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7987153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7987359Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7987465Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7987598Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7987694Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7987698Z 2025-12-04T13:44:25.7987930Z [rank2]:[W1204 13:26:44.216022663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7988101Z [rank1]:[W1204 13:26:44.811523333 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7988276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7988533Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7988696Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7989065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7989267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7989386Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7989483Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7989580Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7989594Z 2025-12-04T13:44:25.7989827Z [rank1]:[W1204 13:26:44.812761526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7990009Z [rank3]:[W1204 13:26:45.206213095 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7990200Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7990454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7990616Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7990984Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7991183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7991289Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7991384Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7991481Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7991483Z 2025-12-04T13:44:25.7991718Z [rank3]:[W1204 13:26:45.208508824 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7991891Z [rank2]:[W1204 13:26:45.216148365 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7992067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7992323Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7992487Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7992853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7993057Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7993162Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7993257Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7993362Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7993366Z 2025-12-04T13:44:25.7993597Z [rank2]:[W1204 13:26:45.217487615 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7993778Z [rank1]:[W1204 13:26:45.812907498 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7993963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7994229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7994391Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7994759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7994964Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7995067Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7995163Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7995259Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7995260Z 2025-12-04T13:44:25.7995493Z [rank1]:[W1204 13:26:45.814160160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7995663Z [rank3]:[W1204 13:26:46.208716095 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7995839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7996099Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7996265Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7996631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7996834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7996939Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7997034Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7997130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7997141Z 2025-12-04T13:44:25.7997373Z [rank3]:[W1204 13:26:46.211109922 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7997591Z [rank2]:[W1204 13:26:46.217656117 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7997780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7998034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.7998211Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7998586Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.7998789Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.7998893Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.7998989Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7999088Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.7999091Z 2025-12-04T13:44:25.7999323Z [rank2]:[W1204 13:26:46.219649313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.7999493Z [rank1]:[W1204 13:26:46.814319422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.7999666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.7999921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8000084Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8000456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8000659Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8000763Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8000859Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8000955Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8000957Z 2025-12-04T13:44:25.8001201Z [rank1]:[W1204 13:26:46.816058863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8001372Z [rank3]:[W1204 13:26:47.211278604 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8001559Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8001826Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8001999Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8002365Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8002567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8002673Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8002768Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8002865Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8002868Z 2025-12-04T13:44:25.8003100Z [rank3]:[W1204 13:26:47.213617702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8003270Z [rank2]:[W1204 13:26:47.219810575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8003446Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8003702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8003867Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8004234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8004436Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8004540Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8004637Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8004735Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8004737Z 2025-12-04T13:44:25.8004974Z [rank2]:[W1204 13:26:47.221897529 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8005153Z [rank1]:[W1204 13:26:47.816198606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8005326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8005590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8005761Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8006141Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8006342Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8006446Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8006542Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8006638Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8006640Z 2025-12-04T13:44:25.8006873Z [rank1]:[W1204 13:26:47.817426029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8007047Z [rank3]:[W1204 13:26:48.213725916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8007224Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8007521Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8007686Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8008054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8008256Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8008359Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8008454Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8008552Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8008553Z 2025-12-04T13:44:25.8008785Z [rank3]:[W1204 13:26:48.215978776 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8008969Z [rank2]:[W1204 13:26:48.222265997 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8009145Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8009405Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8009600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8009965Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8010180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8010283Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8010379Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8010476Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8010478Z 2025-12-04T13:44:25.8010711Z [rank2]:[W1204 13:26:48.224152715 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8010882Z [rank1]:[W1204 13:26:48.817573222 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8011056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8011311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8011476Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8011844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8012047Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8012150Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8012245Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8012341Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8012343Z 2025-12-04T13:44:25.8012576Z [rank1]:[W1204 13:26:48.818828665 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8012747Z [rank3]:[W1204 13:26:49.216159548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8012937Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8013193Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8013366Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8013746Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8013955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8014061Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8014156Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8014254Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8014256Z 2025-12-04T13:44:25.8014490Z [rank3]:[W1204 13:26:49.218369969 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8014661Z [rank2]:[W1204 13:26:49.224311348 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8014837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8015094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8015258Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8015626Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8015832Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8015937Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8016033Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8016131Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8016134Z 2025-12-04T13:44:25.8016366Z [rank2]:[W1204 13:26:49.226619937 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8016538Z [rank1]:[W1204 13:26:49.818967088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8016713Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8016981Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8017143Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8017575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8017777Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8017896Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8017993Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8018091Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8018094Z 2025-12-04T13:44:25.8018327Z [rank1]:[W1204 13:26:49.820233410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8018496Z [rank3]:[W1204 13:26:50.218496724 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8018670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8018929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8019090Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8019459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8019660Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8019765Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8019861Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8019958Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8019960Z 2025-12-04T13:44:25.8020194Z [rank3]:[W1204 13:26:50.221281432 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8020365Z [rank2]:[W1204 13:26:50.226731702 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8020540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8020807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8020970Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8021337Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8021562Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8021679Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8021775Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8021872Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8021874Z 2025-12-04T13:44:25.8022108Z [rank2]:[W1204 13:26:50.228785786 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8022281Z [rank1]:[W1204 13:26:50.820374615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8022459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8022717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8022880Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8023246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8023449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8023553Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8023650Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8023745Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8023747Z 2025-12-04T13:44:25.8023979Z [rank1]:[W1204 13:26:50.821702655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8024149Z [rank3]:[W1204 13:26:51.221446966 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8024326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8024586Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8024760Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8025126Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8025349Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8025454Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8025566Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8025662Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8025664Z 2025-12-04T13:44:25.8025897Z [rank3]:[W1204 13:26:51.223525690 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8026067Z [rank2]:[W1204 13:26:51.228913541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8026244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8026501Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8026664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8027037Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8027240Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8027345Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8027440Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8027579Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8027580Z 2025-12-04T13:44:25.8027815Z [rank2]:[W1204 13:26:51.231206280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8027987Z [rank1]:[W1204 13:26:51.821998256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8028161Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8028416Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8028577Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8028963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8029177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8029293Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8029390Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8029498Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8029500Z 2025-12-04T13:44:25.8029737Z [rank1]:[W1204 13:26:51.823455774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8029906Z [rank3]:[W1204 13:26:52.223691164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8030084Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8030343Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8030507Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8030876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8031080Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8031186Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8031282Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8031378Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8031381Z 2025-12-04T13:44:25.8031614Z [rank3]:[W1204 13:26:52.226023733 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8031785Z [rank2]:[W1204 13:26:52.231354495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8031959Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8032214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8032377Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8032753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8032955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8033069Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8033173Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8033271Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8033282Z 2025-12-04T13:44:25.8033518Z [rank2]:[W1204 13:26:52.233438339 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8033688Z [rank1]:[W1204 13:26:52.823581090 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8033860Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8034117Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8034281Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8034650Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8034852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8034956Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8035051Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8035147Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8035149Z 2025-12-04T13:44:25.8035381Z [rank1]:[W1204 13:26:52.824837762 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8035554Z [rank3]:[W1204 13:26:53.226181478 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8035730Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8035987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8036151Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8036530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8036731Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8036835Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8036946Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8037044Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8037056Z 2025-12-04T13:44:25.8037288Z [rank3]:[W1204 13:26:53.228545606 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8037467Z [rank2]:[W1204 13:26:53.233549795 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8037702Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8037960Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8038125Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8038492Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8038695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8038801Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8038898Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8038994Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8038996Z 2025-12-04T13:44:25.8039228Z [rank2]:[W1204 13:26:53.235948772 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8039400Z [rank1]:[W1204 13:26:53.825023187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8039574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8039831Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8039995Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8040364Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8040582Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8040686Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8040781Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8040891Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8040893Z 2025-12-04T13:44:25.8041143Z [rank1]:[W1204 13:26:53.826431936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8041312Z [rank3]:[W1204 13:26:54.228702421 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8041500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8041758Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8041922Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8042294Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8042495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8042601Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8042695Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8042793Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8042795Z 2025-12-04T13:44:25.8043030Z [rank3]:[W1204 13:26:54.230907662 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8043199Z [rank2]:[W1204 13:26:54.236050509 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8043374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8043629Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8043792Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8044160Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8044366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8044480Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8044575Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8044673Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8044685Z 2025-12-04T13:44:25.8044917Z [rank2]:[W1204 13:26:54.237953576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8045096Z [rank1]:[W1204 13:26:54.826565112 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8045278Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8045536Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8045699Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8046067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8046269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8046374Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8046471Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8046566Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8046568Z 2025-12-04T13:44:25.8046803Z [rank1]:[W1204 13:26:54.827810374 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8046973Z [rank3]:[W1204 13:26:55.231042869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8047147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8047407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8047613Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8047981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8048181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8048286Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8048395Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8048492Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8048494Z 2025-12-04T13:44:25.8048728Z [rank3]:[W1204 13:26:55.233101233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8048923Z [rank2]:[W1204 13:26:55.238094193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8049098Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8049365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8049530Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8049897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8050101Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8050206Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8050301Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8050399Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8050401Z 2025-12-04T13:44:25.8050632Z [rank2]:[W1204 13:26:55.240136148 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8050803Z [rank1]:[W1204 13:26:55.827944431 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8050978Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8051234Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8051398Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8051762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8051966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8052070Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8052166Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8052271Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8052273Z 2025-12-04T13:44:25.8052506Z [rank1]:[W1204 13:26:55.829524736 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8052687Z [rank3]:[W1204 13:26:56.233280179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8052870Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8053129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8053302Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8053668Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8053870Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8053975Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8054070Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8054166Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8054168Z 2025-12-04T13:44:25.8054401Z [rank3]:[W1204 13:26:56.235608158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8054570Z [rank2]:[W1204 13:26:56.240245465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8054745Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8054999Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8055164Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8055532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8055736Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8055841Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8055936Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8056034Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8056035Z 2025-12-04T13:44:25.8056280Z [rank2]:[W1204 13:26:56.242191112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8056452Z [rank1]:[W1204 13:26:56.829666223 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8056641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8056906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8057079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8057449Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8057694Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8057799Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8057895Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8057991Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8057992Z 2025-12-04T13:44:25.8058229Z [rank1]:[W1204 13:26:56.831033173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8058399Z [rank3]:[W1204 13:26:57.235768615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8058573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8058829Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8058991Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8059361Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8059566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8059673Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8059770Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8059865Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8059868Z 2025-12-04T13:44:25.8060114Z [rank3]:[W1204 13:26:57.237792690 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8060283Z [rank2]:[W1204 13:26:57.242326740 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8060458Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8060738Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8060901Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8061281Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8061483Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8061589Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8061684Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8061782Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8061784Z 2025-12-04T13:44:25.8062019Z [rank2]:[W1204 13:26:57.244894873 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8062190Z [rank1]:[W1204 13:26:57.831189800 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8062364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8062622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8062785Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8063151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8063355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8063460Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8063556Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8063652Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8063654Z 2025-12-04T13:44:25.8063889Z [rank1]:[W1204 13:26:57.833313823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8064071Z [rank3]:[W1204 13:26:58.237919668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8064245Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8064501Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8064690Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8065056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8065267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8065371Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8065468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8065563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8065565Z 2025-12-04T13:44:25.8065797Z [rank3]:[W1204 13:26:58.239836645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8065966Z [rank2]:[W1204 13:26:58.245047790 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8066143Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8066399Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8066564Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8066931Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8067133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8067238Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8067334Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8067432Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8067434Z 2025-12-04T13:44:25.8067704Z [rank2]:[W1204 13:26:58.247206112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8067876Z [rank1]:[W1204 13:26:58.833465121 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8068063Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8068319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8068497Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8068878Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8069092Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8069195Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8069292Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8069389Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8069391Z 2025-12-04T13:44:25.8069623Z [rank1]:[W1204 13:26:58.834669134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8069793Z [rank3]:[W1204 13:26:59.240727147 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8069968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8070224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8070386Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8070757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8070959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8071063Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8071159Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8071254Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8071257Z 2025-12-04T13:44:25.8071490Z [rank3]:[W1204 13:26:59.242836000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8071658Z [rank2]:[W1204 13:26:59.247372490 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8071832Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8072106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8072271Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8072661Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8072878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8072985Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8073079Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8073176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8073179Z 2025-12-04T13:44:25.8073410Z [rank2]:[W1204 13:26:59.249627310 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8073580Z [rank1]:[W1204 13:26:59.834767854 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8073755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8074010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8074172Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8074539Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8074740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8074844Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8074940Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8075038Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8075040Z 2025-12-04T13:44:25.8075275Z [rank1]:[W1204 13:26:59.836015756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8075447Z [rank3]:[W1204 13:27:00.243027838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8075622Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8075892Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8076054Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8076439Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8076640Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8076754Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8076849Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8076945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8076946Z 2025-12-04T13:44:25.8077182Z [rank3]:[W1204 13:27:00.244852958 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8077353Z [rank2]:[W1204 13:27:00.249797928 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8077568Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8077832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8077994Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8078361Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8078563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8078669Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8078764Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8078861Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8078863Z 2025-12-04T13:44:25.8079095Z [rank2]:[W1204 13:27:00.251764425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8079269Z [rank1]:[W1204 13:27:00.836409750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8079445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8079701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8079879Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8080246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8080474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8080578Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8080685Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8080782Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8080783Z 2025-12-04T13:44:25.8081016Z [rank1]:[W1204 13:27:00.837941386 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8081188Z [rank3]:[W1204 13:27:01.245019616 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8081362Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8081623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8081787Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8082153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8082356Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8082459Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8082559Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8082654Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8082656Z 2025-12-04T13:44:25.8082890Z [rank3]:[W1204 13:27:01.246581632 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8083058Z [rank2]:[W1204 13:27:01.251919954 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8083234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8083488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8083653Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8084032Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8084243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8084359Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8084454Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8084570Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8084571Z 2025-12-04T13:44:25.8084804Z [rank2]:[W1204 13:27:01.253982198 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8084975Z [rank1]:[W1204 13:27:01.838096245 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8085151Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8085406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8085569Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8085941Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8086144Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8086249Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8086346Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8086441Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8086446Z 2025-12-04T13:44:25.8086678Z [rank1]:[W1204 13:27:01.839539103 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8086850Z [rank3]:[W1204 13:27:02.246745141 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8087026Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8087282Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8087444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8087864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8088067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8088184Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8088294Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8088389Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8088404Z 2025-12-04T13:44:25.8088637Z [rank3]:[W1204 13:27:02.249165377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8088807Z [rank2]:[W1204 13:27:02.254157187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8088983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8089240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8089405Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8089776Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8089977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8090083Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8090178Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8090277Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8090278Z 2025-12-04T13:44:25.8090513Z [rank2]:[W1204 13:27:02.256838158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8090685Z [rank1]:[W1204 13:27:02.839674563 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8090860Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8091115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8091279Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8091658Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8091860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8091974Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8092069Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8092176Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8092179Z 2025-12-04T13:44:25.8092413Z [rank1]:[W1204 13:27:02.841066002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8092597Z [rank3]:[W1204 13:27:03.249317907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8092771Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8093030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8093194Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8093564Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8093766Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8093870Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8093967Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8094062Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8094065Z 2025-12-04T13:44:25.8094298Z [rank3]:[W1204 13:27:03.251594027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8094468Z [rank2]:[W1204 13:27:03.256953018 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8094644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8094899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8095062Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8095433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8099594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8099702Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8099816Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8099914Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8099916Z 2025-12-04T13:44:25.8100163Z [rank2]:[W1204 13:27:03.259038612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8100349Z [rank1]:[W1204 13:27:03.841222322 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8100524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8100782Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8100947Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8101315Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8101518Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8101622Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8101718Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8101815Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8101817Z 2025-12-04T13:44:25.8102052Z [rank1]:[W1204 13:27:03.843050572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8102223Z [rank3]:[W1204 13:27:04.251767237 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8102399Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8102655Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8102819Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8103190Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8103392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8103506Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8103603Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8103699Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8103716Z 2025-12-04T13:44:25.8103967Z [rank3]:[W1204 13:27:04.253022989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8104137Z [rank2]:[W1204 13:27:04.259196852 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8104324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8104578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8104740Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8105112Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8105315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8105421Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8105517Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8105614Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8105617Z 2025-12-04T13:44:25.8105848Z [rank2]:[W1204 13:27:04.261490722 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8106020Z [rank1]:[W1204 13:27:04.843205032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8106194Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8106450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8106612Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8106979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8107182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8107287Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8107395Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8107532Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8107534Z 2025-12-04T13:44:25.8107765Z [rank1]:[W1204 13:27:04.845218628 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8107962Z [rank3]:[W1204 13:27:05.253218029 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8108137Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8108407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8108570Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8108942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8109144Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8109249Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8109346Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8109442Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8109444Z 2025-12-04T13:44:25.8109680Z [rank3]:[W1204 13:27:05.255573327 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8109852Z [rank2]:[W1204 13:27:05.261655992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8110027Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8110282Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8110445Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8110811Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8111013Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8111118Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8111213Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8111321Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8111323Z 2025-12-04T13:44:25.8111555Z [rank2]:[W1204 13:27:05.263622729 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8111736Z [rank1]:[W1204 13:27:05.845373079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8111921Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8112179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8112352Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8112717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8112920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8113024Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8113120Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8113216Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8113218Z 2025-12-04T13:44:25.8113451Z [rank1]:[W1204 13:27:05.847400614 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8113621Z [rank3]:[W1204 13:27:06.255757057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8113796Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8114053Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8114217Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8114583Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8114785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8114889Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8114987Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8115082Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8115084Z 2025-12-04T13:44:25.8115333Z [rank3]:[W1204 13:27:06.258195103 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8115503Z [rank2]:[W1204 13:27:06.263769520 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8115689Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8115952Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8116128Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8116496Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8116698Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8116804Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8116899Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8116997Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8116999Z 2025-12-04T13:44:25.8117232Z [rank2]:[W1204 13:27:06.265693548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8117404Z [rank1]:[W1204 13:27:06.847555216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8117614Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8117871Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8118034Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8118403Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8118606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8118709Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8118807Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8118902Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8118906Z 2025-12-04T13:44:25.8119153Z [rank1]:[W1204 13:27:06.849115751 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8119324Z [rank3]:[W1204 13:27:07.258328906 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8119516Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8119783Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8119944Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8120325Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8120527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8120632Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8120729Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8120824Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8120827Z 2025-12-04T13:44:25.8121060Z [rank3]:[W1204 13:27:07.260736422 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8121231Z [rank2]:[W1204 13:27:07.265860439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8121405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8121662Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8121825Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8122193Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8122395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8122500Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8122594Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8122693Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8122695Z 2025-12-04T13:44:25.8122927Z [rank2]:[W1204 13:27:07.267959663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8123107Z [rank1]:[W1204 13:27:07.849275833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8123283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8123547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8123719Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8124086Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8124297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8124402Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8124498Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8124596Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8124599Z 2025-12-04T13:44:25.8124834Z [rank1]:[W1204 13:27:07.850773640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8125007Z [rank3]:[W1204 13:27:08.260949973 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8125181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8125436Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8125599Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8125967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8126169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8126272Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8126370Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8126465Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8126467Z 2025-12-04T13:44:25.8126700Z [rank3]:[W1204 13:27:08.263432208 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8126870Z [rank2]:[W1204 13:27:08.268112465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8127057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8127313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8127518Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8127905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8128121Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8128226Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8128320Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8128418Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8128420Z 2025-12-04T13:44:25.8128651Z [rank2]:[W1204 13:27:08.270086061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8128823Z [rank1]:[W1204 13:27:08.850932312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8128999Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8129259Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8129425Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8129793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8129996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8130100Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8130196Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8130292Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8130295Z 2025-12-04T13:44:25.8130528Z [rank1]:[W1204 13:27:08.852559056 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8130698Z [rank3]:[W1204 13:27:09.263652059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8130872Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8131139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8131314Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8131696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8131910Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8132014Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8132109Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8132205Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8132208Z 2025-12-04T13:44:25.8132442Z [rank3]:[W1204 13:27:09.266015407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8132610Z [rank2]:[W1204 13:27:09.270255063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8132786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8133041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8133203Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8133574Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8133778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8133883Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8133978Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8134074Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8134077Z 2025-12-04T13:44:25.8134308Z [rank2]:[W1204 13:27:09.272360977 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8134479Z [rank1]:[W1204 13:27:09.852740668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8134654Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8134924Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8135088Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8135474Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8135676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8135790Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8135887Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8135983Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8135986Z 2025-12-04T13:44:25.8136217Z [rank1]:[W1204 13:27:09.854740524 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8136390Z [rank3]:[W1204 13:27:10.266215389 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8136564Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8136823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8136986Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8137353Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8137593Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8137698Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8137795Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8137890Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8137892Z 2025-12-04T13:44:25.8138129Z [rank3]:[W1204 13:27:10.268471319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8138299Z [rank2]:[W1204 13:27:10.272521040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8138475Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8138730Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8138907Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8139275Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8139499Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8139620Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8139715Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8139812Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8139814Z 2025-12-04T13:44:25.8140046Z [rank2]:[W1204 13:27:10.274623913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8140220Z [rank1]:[W1204 13:27:10.854906427 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8140395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8140650Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8140815Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8141180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8141383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8141487Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8141584Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8141679Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8141682Z 2025-12-04T13:44:25.8141914Z [rank1]:[W1204 13:27:10.857044480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8142085Z [rank3]:[W1204 13:27:11.268662802 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8142260Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8142519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8142691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8143057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8143269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8143382Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8143478Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8143585Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8143586Z 2025-12-04T13:44:25.8143820Z [rank3]:[W1204 13:27:11.270706927 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8143989Z [rank2]:[W1204 13:27:11.274866525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8144165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8144422Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8144587Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8144955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8145156Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8145262Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8145358Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8145456Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8145458Z 2025-12-04T13:44:25.8145690Z [rank2]:[W1204 13:27:11.276636796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8145860Z [rank1]:[W1204 13:27:11.857186564 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8146035Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8146289Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8146453Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8146834Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8147045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8147149Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8147254Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8147350Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8147367Z 2025-12-04T13:44:25.8147641Z [rank1]:[W1204 13:27:11.858440826 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8147810Z [rank3]:[W1204 13:27:12.270905760 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8147984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8148241Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8148403Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8148771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8148977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8149082Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8149178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8149274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8149277Z 2025-12-04T13:44:25.8149510Z [rank3]:[W1204 13:27:12.273094771 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8149679Z [rank2]:[W1204 13:27:12.276787060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8149853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8150110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8150271Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8150651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8150852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8150969Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8151064Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8151175Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8151177Z 2025-12-04T13:44:25.8151409Z [rank2]:[W1204 13:27:12.279234365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8151595Z [rank1]:[W1204 13:27:12.858590300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8151769Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8152025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8152189Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8152554Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8152756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8152861Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8152957Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8153054Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8153056Z 2025-12-04T13:44:25.8153291Z [rank1]:[W1204 13:27:12.860609696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8153464Z [rank3]:[W1204 13:27:13.273309264 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8153638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8153893Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8154057Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8154423Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8154634Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8154738Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8154845Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8154940Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8154942Z 2025-12-04T13:44:25.8155183Z [rank3]:[W1204 13:27:13.274956458 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8155366Z [rank2]:[W1204 13:27:13.279417769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8155545Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8155806Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8155969Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8156342Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8156544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8156649Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8156744Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8156842Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8156843Z 2025-12-04T13:44:25.8157075Z [rank2]:[W1204 13:27:13.281548242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8157246Z [rank1]:[W1204 13:27:13.860723421 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8157422Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8157716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8157880Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8158245Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8158459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8158564Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8158659Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8158770Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8158772Z 2025-12-04T13:44:25.8159016Z [rank1]:[W1204 13:27:13.862173609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8159186Z [rank3]:[W1204 13:27:14.275141442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8159372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8159628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8159792Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8160165Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8160367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8160471Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8160566Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8160661Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8160664Z 2025-12-04T13:44:25.8160897Z [rank3]:[W1204 13:27:14.276982941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8161068Z [rank2]:[W1204 13:27:14.281718417 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8161243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8161499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8161661Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8162037Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8162239Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8162354Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8162450Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8162546Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8162558Z 2025-12-04T13:44:25.8162791Z [rank2]:[W1204 13:27:14.283847530 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8162975Z [rank1]:[W1204 13:27:14.862345484 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8163160Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8163416Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8163579Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8163947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8164151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8164258Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8164354Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8164451Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8164453Z 2025-12-04T13:44:25.8164686Z [rank1]:[W1204 13:27:14.863606126 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8164859Z [rank3]:[W1204 13:27:15.277101817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8165034Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8165294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8165456Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8165825Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8166029Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8166133Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8166229Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8166334Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8166336Z 2025-12-04T13:44:25.8166571Z [rank3]:[W1204 13:27:15.278266832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8166757Z [rank2]:[W1204 13:27:15.284027284 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8166942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8167207Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8167370Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8167765Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8167967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8168073Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8168168Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8168266Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8168268Z 2025-12-04T13:44:25.8168504Z [rank2]:[W1204 13:27:15.286188866 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8168676Z [rank1]:[W1204 13:27:15.863789831 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8168853Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8169108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8169279Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8169644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8169847Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8169951Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8170046Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8170142Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8170157Z 2025-12-04T13:44:25.8170389Z [rank1]:[W1204 13:27:15.865682899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8170572Z [rank3]:[W1204 13:27:16.278440507 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8170760Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8171020Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8171194Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8171561Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8171762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8171866Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8171962Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8172059Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8172060Z 2025-12-04T13:44:25.8172294Z [rank3]:[W1204 13:27:16.279689139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8172463Z [rank2]:[W1204 13:27:16.286360562 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8172638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8172897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8173061Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8173429Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8173631Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8173736Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8173831Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8173929Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8173931Z 2025-12-04T13:44:25.8174174Z [rank2]:[W1204 13:27:16.288704160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8174218Z PASSED [146.9985s] [ 2%] 2025-12-04T13:44:25.8174517Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline I1204 13:27:16.533000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 70936 2025-12-04T13:44:25.8174679Z I1204 13:27:16.534000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 70937 2025-12-04T13:44:25.8174838Z I1204 13:27:16.534000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 70938 2025-12-04T13:44:25.8174994Z I1204 13:27:16.534000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 70939 2025-12-04T13:44:25.8175167Z [rank1]:[W1204 13:27:16.865846875 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8175342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8175600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8175765Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8176132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8176335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8176440Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8176537Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8176633Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8176634Z 2025-12-04T13:44:25.8176869Z [rank1]:[W1204 13:27:16.867099417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8177040Z [rank3]:[W1204 13:27:17.279856925 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8177215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8177506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8177670Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8178038Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8178254Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8178359Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8178455Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8178570Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8178572Z 2025-12-04T13:44:25.8178820Z [rank3]:[W1204 13:27:17.281086388 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8179001Z [rank2]:[W1204 13:27:17.288903345 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8179177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8179431Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8179597Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8179967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8180169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8180274Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8180368Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8180466Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8180468Z 2025-12-04T13:44:25.8180701Z [rank2]:[W1204 13:27:17.291135106 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8180871Z [rank1]:[W1204 13:27:17.867261114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8181044Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8181299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8181461Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8181831Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8182037Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8182150Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8182246Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8182341Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8182352Z 2025-12-04T13:44:25.8182595Z [rank1]:[W1204 13:27:17.868899787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8182765Z [rank3]:[W1204 13:27:18.281240515 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8182949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8183205Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8183368Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8183734Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8183937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8184043Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8184138Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8184233Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8184236Z 2025-12-04T13:44:25.8184469Z [rank3]:[W1204 13:27:18.282474027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8184637Z [rank2]:[W1204 13:27:18.291281413 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8184812Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8185065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8185229Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8185595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8185795Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8185900Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8186004Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8186102Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8186104Z 2025-12-04T13:44:25.8186336Z [rank2]:[W1204 13:27:18.292525655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8186528Z [rank1]:[W1204 13:27:18.869020715 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8186702Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8186969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8187131Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8187535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8187737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8187841Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8187938Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8188033Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8188035Z 2025-12-04T13:44:25.8188270Z [rank1]:[W1204 13:27:18.870190579 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8188442Z [rank3]:[W1204 13:27:19.282635074 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8188616Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8188872Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8189034Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8189400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8189601Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8189705Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8189800Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8189911Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8189913Z 2025-12-04T13:44:25.8190146Z [rank3]:[W1204 13:27:19.283862647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8190327Z [rank2]:[W1204 13:27:19.292693932 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8190516Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8190772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8190951Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8191317Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8191520Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8191625Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8191721Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8191818Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8191821Z 2025-12-04T13:44:25.8192055Z [rank2]:[W1204 13:27:19.294155420 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8192228Z [rank1]:[W1204 13:27:19.870344077 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8192404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8192663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8192829Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8193197Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8193402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8193505Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8193600Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8193697Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8193699Z 2025-12-04T13:44:25.8193953Z [rank1]:[W1204 13:27:19.871780365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8194124Z [rank3]:[W1204 13:27:20.284026425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8194307Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8194573Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8194746Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8195120Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8195321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8195426Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8195522Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8195619Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8195620Z 2025-12-04T13:44:25.8195855Z [rank3]:[W1204 13:27:20.285257007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8196023Z [rank2]:[W1204 13:27:20.294305657 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8196200Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8196454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8196621Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8196990Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8197194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8197299Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8197395Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8197526Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8197530Z 2025-12-04T13:44:25.8197778Z [rank2]:[W1204 13:27:20.295699467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8197949Z [rank1]:[W1204 13:27:20.871934513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8198134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8198404Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8198567Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8198946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8199147Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8199252Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8199349Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8199445Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8199448Z 2025-12-04T13:44:25.8199680Z [rank1]:[W1204 13:27:20.873399660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8199851Z [rank3]:[W1204 13:27:21.285399796 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8200024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8200282Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8200444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8200810Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8201012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8201117Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8201213Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8201309Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8201311Z 2025-12-04T13:44:25.8201547Z [rank3]:[W1204 13:27:21.287230425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8201726Z [rank2]:[W1204 13:27:21.295855365 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8201901Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8202169Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8202341Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8202708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8202920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8203027Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8203121Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8203220Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8203222Z 2025-12-04T13:44:25.8203455Z [rank2]:[W1204 13:27:21.297225884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8203627Z [rank1]:[W1204 13:27:21.875114654 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8203803Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8204057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8204222Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8204586Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8204788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8204892Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8204989Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8205084Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8205087Z 2025-12-04T13:44:25.8205321Z [rank1]:[W1204 13:27:21.876541472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8205491Z [rank3]:[W1204 13:27:22.287404183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8205675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8205934Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8206108Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8206483Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8206702Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8206806Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8206902Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8206999Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8207001Z 2025-12-04T13:44:25.8207235Z [rank3]:[W1204 13:27:22.288651866 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8207405Z [rank2]:[W1204 13:27:22.297379193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8207623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8207876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8208043Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8208415Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8208617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8208721Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8208815Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8208913Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8208914Z 2025-12-04T13:44:25.8209148Z [rank2]:[W1204 13:27:22.298702014 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8209318Z [rank1]:[W1204 13:27:22.878219337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8209493Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8209764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8209939Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8210319Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8210532Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8210637Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8210733Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8210829Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8210833Z 2025-12-04T13:44:25.8211067Z [rank1]:[W1204 13:27:22.879743834 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8211237Z [rank3]:[W1204 13:27:23.288833804 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8211411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8211669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8211831Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8212198Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8212403Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8212507Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8212605Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8212700Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8212703Z 2025-12-04T13:44:25.8212937Z [rank3]:[W1204 13:27:23.290095796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8213108Z [rank2]:[W1204 13:27:23.298858642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8213284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8213548Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8213713Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8214101Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8214302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8214419Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8214515Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8214613Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8214615Z 2025-12-04T13:44:25.8214847Z [rank2]:[W1204 13:27:23.300940486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8215018Z [rank1]:[W1204 13:27:23.879920332 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8215192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8215447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8215610Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8215975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8216176Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8216281Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8216376Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8216472Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8216476Z 2025-12-04T13:44:25.8216711Z [rank1]:[W1204 13:27:23.882050125 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8216884Z [rank3]:[W1204 13:27:24.290264495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8217057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8217313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8217519Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8217885Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8218111Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8218227Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8218322Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8218419Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8218421Z 2025-12-04T13:44:25.8218654Z [rank3]:[W1204 13:27:24.291494518 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8218831Z [rank2]:[W1204 13:27:24.301116835 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8219009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8219266Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8219432Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8219799Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8220002Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8220107Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8220203Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8220299Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8220301Z 2025-12-04T13:44:25.8220532Z [rank2]:[W1204 13:27:24.302917255 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8220704Z [rank1]:[W1204 13:27:24.882239314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8220879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8221135Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8221313Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8221680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8221892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8222011Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8222107Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8222214Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8222216Z 2025-12-04T13:44:25.8222448Z [rank1]:[W1204 13:27:24.884373767 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8222618Z [rank3]:[W1204 13:27:25.291655798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8222793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8223050Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8223212Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8223582Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8223783Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8223887Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8223982Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8224079Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8224081Z 2025-12-04T13:44:25.8224314Z [rank3]:[W1204 13:27:25.292906070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8224484Z [rank2]:[W1204 13:27:25.303069835 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8224658Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8224913Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8225076Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8225452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8225656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8225771Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8225876Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8225973Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8225984Z 2025-12-04T13:44:25.8226217Z [rank2]:[W1204 13:27:25.305278616 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8226427Z [rank1]:W1204 13:27:25.543000 70937 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.8226599Z [rank1]:[W1204 13:27:25.884570696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8226773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8227029Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8227193Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8227591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8227799Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8227905Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8228001Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8228100Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8228102Z 2025-12-04T13:44:25.8228337Z [rank1]:[W1204 13:27:25.886793647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8228506Z [rank3]:[W1204 13:27:26.293070860 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8228682Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8228939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8229102Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8229480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8229693Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8229809Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8229905Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8230014Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8230016Z 2025-12-04T13:44:25.8230252Z [rank3]:[W1204 13:27:26.294321082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8230423Z [rank2]:[W1204 13:27:26.305370968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8230598Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8230853Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8231016Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8231381Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8231583Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8231687Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8231784Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8231879Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8231883Z 2025-12-04T13:44:25.8232117Z [rank2]:[W1204 13:27:26.307351474 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8232287Z [rank1]:[W1204 13:27:26.886976297 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8232460Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8232715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8232877Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8233252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8233453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8233567Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8233673Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8233768Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8233780Z 2025-12-04T13:44:25.8234014Z [rank1]:[W1204 13:27:26.889138299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8234184Z [rank3]:[W1204 13:27:27.294487823 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8234360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8234618Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8234780Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8235148Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8235348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8235454Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8235548Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8235645Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8235646Z 2025-12-04T13:44:25.8235878Z [rank3]:[W1204 13:27:27.295723705 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8236050Z [rank2]:[W1204 13:27:27.307453246 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8236224Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8236482Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8236647Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8237022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8237223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8237327Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8237438Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8237584Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8237588Z 2025-12-04T13:44:25.8237820Z [rank2]:[W1204 13:27:27.309396133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8238005Z [rank1]:[W1204 13:27:27.889321399 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8238178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8238433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8238596Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8238966Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8239169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8239272Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8239368Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8239463Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8239465Z 2025-12-04T13:44:25.8239699Z [rank1]:[W1204 13:27:27.891543460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8239870Z [rank3]:[W1204 13:27:28.295863947 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8240045Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8240301Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8240464Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8240838Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8241054Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8241159Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8241254Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8241362Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8241364Z 2025-12-04T13:44:25.8241605Z [rank3]:[W1204 13:27:28.297090270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8241787Z [rank2]:[W1204 13:27:28.309515155 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8241964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8242218Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8242384Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8242752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8242957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8243061Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8243157Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8243254Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8243257Z 2025-12-04T13:44:25.8243490Z [rank2]:[W1204 13:27:28.311032252 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8243659Z [rank1]:[W1204 13:27:28.891722241 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8243833Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8244086Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8244249Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8244616Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8244820Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8244933Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8245029Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8245124Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8245136Z 2025-12-04T13:44:25.8245382Z [rank1]:[W1204 13:27:28.893953082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8245552Z [rank3]:[W1204 13:27:29.297263171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8245737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8245992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8246154Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8246522Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8246723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8246829Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8246923Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8247021Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8247023Z 2025-12-04T13:44:25.8247258Z [rank3]:[W1204 13:27:29.298503593 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8247432Z [rank2]:[W1204 13:27:29.311159944 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8247644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8247901Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8248063Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8248430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8248632Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8248737Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8248846Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8248944Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8248945Z 2025-12-04T13:44:25.8249178Z [rank2]:[W1204 13:27:29.313100721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8249376Z [rank1]:[W1204 13:27:29.894102894 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8249550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8249821Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8249982Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8250349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8250553Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8250657Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8250752Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8250848Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8250850Z 2025-12-04T13:44:25.8251082Z [rank1]:[W1204 13:27:29.896116789 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8251252Z [rank3]:[W1204 13:27:30.298631676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8251427Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8251686Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8251851Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8252217Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8252419Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8252523Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8252618Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8252730Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8252731Z 2025-12-04T13:44:25.8252965Z [rank3]:[W1204 13:27:30.299823780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8253147Z [rank2]:[W1204 13:27:30.313281462 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8253330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8253585Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8253761Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8254130Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8254333Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8254438Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8254535Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8254632Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8254633Z 2025-12-04T13:44:25.8254867Z [rank2]:[W1204 13:27:30.314952435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8255038Z [rank1]:[W1204 13:27:30.896293221 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8255212Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8255466Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8255629Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8255996Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8256199Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8256304Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8256400Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8256497Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8256498Z 2025-12-04T13:44:25.8256741Z [rank1]:[W1204 13:27:30.898473173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8256910Z [rank3]:[W1204 13:27:31.300023581 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8257094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8257361Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8257575Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8257940Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8258141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8258246Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8258341Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8258438Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8258440Z 2025-12-04T13:44:25.8258673Z [rank3]:[W1204 13:27:31.301839361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8258843Z [rank2]:[W1204 13:27:31.315095038 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8259019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8259275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8259439Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8259807Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8260008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8260112Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8260209Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8260306Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8260309Z 2025-12-04T13:44:25.8260557Z [rank2]:[W1204 13:27:31.317063375 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8260730Z [rank1]:[W1204 13:27:31.898678074 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8260903Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8261183Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8261344Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8261723Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8261924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8262028Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8262124Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8262220Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8262223Z 2025-12-04T13:44:25.8262457Z [rank1]:[W1204 13:27:31.901183459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8262628Z [rank3]:[W1204 13:27:32.302030183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8262806Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8263063Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8263225Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8263593Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8263793Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8263900Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8263995Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8264091Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8264093Z 2025-12-04T13:44:25.8264327Z [rank3]:[W1204 13:27:32.304144256 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8264508Z [rank2]:[W1204 13:27:32.317172468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8264683Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8264949Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8265122Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8265491Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8265708Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8265813Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8265909Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8266006Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8266008Z 2025-12-04T13:44:25.8266241Z [rank2]:[W1204 13:27:32.318868781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8266413Z [rank1]:[W1204 13:27:32.901355252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8266587Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8266845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8267009Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8267376Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8267615Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8267719Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8267815Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8267910Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8267912Z 2025-12-04T13:44:25.8268146Z [rank1]:[W1204 13:27:32.903624952 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8268317Z [rank3]:[W1204 13:27:33.304315609 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8268504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8268760Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8268936Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8269316Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8269529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8269634Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8269728Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8269826Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8269828Z 2025-12-04T13:44:25.8270065Z [rank3]:[W1204 13:27:33.306367194 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8270236Z [rank2]:[W1204 13:27:33.319011155 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8270412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8270666Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8270829Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8271195Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8271402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8271508Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8271603Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8271700Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8271702Z 2025-12-04T13:44:25.8271934Z [rank2]:[W1204 13:27:33.321211486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8272105Z [rank1]:[W1204 13:27:33.903801135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8272280Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8272543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8272705Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8273091Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8273303Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8273408Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8273504Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8273600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8273603Z 2025-12-04T13:44:25.8273838Z [rank1]:[W1204 13:27:33.905853799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8274008Z [rank3]:[W1204 13:27:34.306533307 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8274183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8274440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8274600Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8274966Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8275166Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8275272Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8275367Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8275463Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8275465Z 2025-12-04T13:44:25.8275701Z [rank3]:[W1204 13:27:34.308757978 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8275873Z [rank2]:[W1204 13:27:34.321351070 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8276048Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8276314Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8276477Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8276860Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8277062Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8277179Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8277275Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8277371Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8277373Z 2025-12-04T13:44:25.8277639Z [rank2]:[W1204 13:27:34.324609568 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8277811Z [rank1]:[W1204 13:27:34.906038393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8277988Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8278252Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8278415Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8278785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8278991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8279096Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8279192Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8279290Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8279292Z 2025-12-04T13:44:25.8279526Z [rank1]:[W1204 13:27:34.908331562 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8279697Z [rank3]:[W1204 13:27:35.308922502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8279873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8280132Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8280310Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8280677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8280910Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8281017Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8281125Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8281222Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8281224Z 2025-12-04T13:44:25.8281457Z [rank3]:[W1204 13:27:35.310840070 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8281627Z [rank2]:[W1204 13:27:35.324776832 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8281802Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8282056Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8282221Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8282594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8282798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8282903Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8282998Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8283095Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8283096Z 2025-12-04T13:44:25.8283329Z [rank2]:[W1204 13:27:35.326898585 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8283499Z [rank1]:[W1204 13:27:35.908513406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8283673Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8283930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8284094Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8284474Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8284690Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8284805Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8284902Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8285008Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8285010Z 2025-12-04T13:44:25.8285244Z [rank1]:[W1204 13:27:35.909870696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8285412Z [rank3]:[W1204 13:27:36.311022674 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8285588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8285844Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8286007Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8286373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8286573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8286679Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8286775Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8286873Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8286875Z 2025-12-04T13:44:25.8287108Z [rank3]:[W1204 13:27:36.312267416 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8287279Z [rank2]:[W1204 13:27:36.327016891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8287455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8287746Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8287909Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8288288Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8288492Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8288612Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8288721Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8288820Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8288835Z 2025-12-04T13:44:25.8289072Z [rank2]:[W1204 13:27:36.328805341 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8289242Z [rank1]:[W1204 13:27:36.910153628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8289415Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8289672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8289834Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8290200Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8290402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8290506Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8290603Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8290699Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8290701Z 2025-12-04T13:44:25.8290935Z [rank1]:[W1204 13:27:36.912405558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8291106Z [rank3]:[W1204 13:27:37.312430331 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8291281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8291538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8291700Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8292077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8292277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8292392Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8292487Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8292594Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8292596Z 2025-12-04T13:44:25.8292829Z [rank3]:[W1204 13:27:37.313664754 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8293012Z [rank2]:[W1204 13:27:37.328954596 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8293189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8293445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8293611Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8293978Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8294181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8294286Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8294383Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8294480Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8294482Z 2025-12-04T13:44:25.8294715Z [rank2]:[W1204 13:27:37.331200417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8294886Z [rank1]:[W1204 13:27:37.912561034 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8295059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8295313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8295481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8295848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8296060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8296164Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8296277Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8296372Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8296374Z 2025-12-04T13:44:25.8296619Z [rank1]:[W1204 13:27:37.914425222 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8296838Z [rank3]:W1204 13:27:38.144000 70939 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.8297008Z [rank3]:[W1204 13:27:38.313832869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8297182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8297437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8297640Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8298010Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8298211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8298317Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8298412Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8298509Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8298511Z 2025-12-04T13:44:25.8298742Z [rank3]:[W1204 13:27:38.315126261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8298914Z [rank2]:[W1204 13:27:38.331562837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8299087Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8299342Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8299507Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8299876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8300093Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8300197Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8300306Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8300404Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8300418Z 2025-12-04T13:44:25.8300653Z [rank2]:[W1204 13:27:38.334775696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8300836Z [rank1]:[W1204 13:27:38.914592098 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8301011Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8301269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8301433Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8301804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8302008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8302113Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8302208Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8302306Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8302307Z 2025-12-04T13:44:25.8302542Z [rank1]:[W1204 13:27:38.916810609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8302713Z [rank3]:[W1204 13:27:39.315297636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8302888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8303145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8303309Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8303675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8303887Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8303993Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8304090Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8304196Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8304198Z 2025-12-04T13:44:25.8304440Z [rank3]:[W1204 13:27:39.316544169 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8304611Z [rank2]:[W1204 13:27:39.334907933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8304798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8305054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8305218Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8305585Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8305787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8305892Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8305988Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8306087Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8306090Z 2025-12-04T13:44:25.8306327Z [rank2]:[W1204 13:27:39.336689804 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8306497Z [rank1]:[W1204 13:27:39.916978485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8306674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8306930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8307093Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8307459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8307704Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8307821Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8307916Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8308014Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8308029Z 2025-12-04T13:44:25.8308265Z [rank1]:[W1204 13:27:39.919175656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8308448Z [rank3]:[W1204 13:27:40.316710815 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8308637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8308893Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8309055Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8309424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8309625Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8309733Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8309828Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8309926Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8309928Z 2025-12-04T13:44:25.8310163Z [rank3]:[W1204 13:27:40.318104524 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8310337Z [rank2]:[W1204 13:27:40.336826350 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8310510Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8310767Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8310929Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8311297Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8311501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8311605Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8311717Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8311814Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8311815Z 2025-12-04T13:44:25.8312049Z [rank2]:[W1204 13:27:40.338799027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8312239Z [rank1]:[W1204 13:27:40.919355382 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8312417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8312686Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8312848Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8313213Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8313415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8313521Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8313617Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8313714Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8313716Z 2025-12-04T13:44:25.8313949Z [rank1]:[W1204 13:27:40.921824308 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8314119Z [rank3]:[W1204 13:27:41.318267441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8314294Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8314550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8314717Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8315083Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8315286Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8315391Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8315487Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8315592Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8315594Z 2025-12-04T13:44:25.8315826Z [rank3]:[W1204 13:27:41.319505503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8316008Z [rank2]:[W1204 13:27:41.338951154 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8316191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8316449Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8316624Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8316993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8317197Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8317300Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8317397Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8317522Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8317524Z 2025-12-04T13:44:25.8317761Z [rank2]:[W1204 13:27:41.340732234 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8317931Z [rank1]:[W1204 13:27:41.922012814 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8318109Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8318365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8318527Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8318900Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8319106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8319216Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8319310Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8319408Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8319409Z 2025-12-04T13:44:25.8319660Z [rank1]:[W1204 13:27:41.924468420 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8319831Z [rank3]:[W1204 13:27:42.319694620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8320017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8320285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8320461Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8320829Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8321031Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8321140Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8321235Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8321333Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8321336Z 2025-12-04T13:44:25.8321570Z [rank3]:[W1204 13:27:42.321824533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8321741Z [rank2]:[W1204 13:27:42.340878872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8321914Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8322172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8322336Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8322704Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8322907Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8323011Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8323108Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8323204Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8323208Z 2025-12-04T13:44:25.8323455Z [rank2]:[W1204 13:27:42.342713321 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8323626Z [rank1]:[W1204 13:27:42.924638207 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8323799Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8324077Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8324240Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8324626Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8324826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8324932Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8325027Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8325126Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8325128Z 2025-12-04T13:44:25.8325363Z [rank1]:[W1204 13:27:42.926955596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8325535Z [rank3]:[W1204 13:27:43.322006340 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8325710Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8325966Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8326132Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8326502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8326707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8326812Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8326907Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8327003Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8327005Z 2025-12-04T13:44:25.8327239Z [rank3]:[W1204 13:27:43.323893728 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8327421Z [rank2]:[W1204 13:27:43.342843579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8327637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8327895Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8328082Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8328448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8328666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8328770Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8328870Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8328966Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8328968Z 2025-12-04T13:44:25.8329205Z [rank2]:[W1204 13:27:43.344956833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8329378Z [rank1]:[W1204 13:27:43.927127243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8329553Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8329809Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8329972Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8330341Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8330542Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8330647Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8330743Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8330839Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8330841Z 2025-12-04T13:44:25.8331075Z [rank1]:[W1204 13:27:43.929308945 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8331282Z [rank0]:W1204 13:27:44.211000 70936 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.8331502Z [rank2]:W1204 13:27:44.252000 70938 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.8331675Z [rank3]:[W1204 13:27:44.324049296 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8331862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8332127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8332303Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8332669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8332869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8332976Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8333070Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8333167Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8333170Z 2025-12-04T13:44:25.8333403Z [rank3]:[W1204 13:27:44.325933564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8333574Z [rank2]:[W1204 13:27:44.345056382 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8333755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8334014Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8334179Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8334544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8334747Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8334851Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8334948Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8335045Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8335048Z 2025-12-04T13:44:25.8335282Z [rank2]:[W1204 13:27:44.347354361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8335464Z [rank1]:[W1204 13:27:44.929459544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8335637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8335915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8336079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8336459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8336661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8336766Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8336862Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8336958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8336960Z 2025-12-04T13:44:25.8337194Z [rank1]:[W1204 13:27:44.930746205 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8337364Z [rank3]:[W1204 13:27:45.326112102 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8337572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8337828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8337992Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8338362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8338563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8338670Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8338764Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8338861Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8338863Z 2025-12-04T13:44:25.8339096Z [rank3]:[W1204 13:27:45.327981191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8339280Z [rank2]:[W1204 13:27:45.347493820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8339456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8339709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8339904Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8340270Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8340489Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8340592Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8340689Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8340786Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8340788Z 2025-12-04T13:44:25.8341021Z [rank2]:[W1204 13:27:45.349240811 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8341197Z [rank1]:[W1204 13:27:45.930920393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8341372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8341627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8341790Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8342158Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8342361Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8342467Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8342564Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8342661Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8342663Z 2025-12-04T13:44:25.8342896Z [rank1]:[W1204 13:27:45.933155374 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8343065Z [rank3]:[W1204 13:27:46.328136570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8343249Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8343506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8343679Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8344054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8344267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8344374Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8344468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8344565Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8344567Z 2025-12-04T13:44:25.8344806Z [rank3]:[W1204 13:27:46.329590608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8344977Z [rank2]:[W1204 13:27:46.349419740 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8345154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8345407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8345571Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8345937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8346140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8346244Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8350080Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8350184Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8350189Z 2025-12-04T13:44:25.8350428Z [rank2]:[W1204 13:27:46.351737189 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8350601Z [rank1]:[W1204 13:27:46.933278974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8350780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8351060Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8351223Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8351623Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8351842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8351948Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8352044Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8352141Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8352144Z 2025-12-04T13:44:25.8352379Z [rank1]:[W1204 13:27:46.935410587 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8352551Z [rank3]:[W1204 13:27:47.329799846 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8352728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8352989Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8353153Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8353527Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8353726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8353832Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8353928Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8354025Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8354027Z 2025-12-04T13:44:25.8354261Z [rank3]:[W1204 13:27:47.331634886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8354435Z [rank2]:[W1204 13:27:47.351873339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8354612Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8354881Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8355056Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8355431Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8355643Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8355759Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8355856Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8355953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8355955Z 2025-12-04T13:44:25.8356188Z [rank2]:[W1204 13:27:47.353933273 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8356360Z [rank1]:[W1204 13:27:47.935558007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8356535Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8356795Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8356957Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8357323Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8357569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8357674Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8357769Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8357865Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8357867Z 2025-12-04T13:44:25.8358099Z [rank1]:[W1204 13:27:47.937070523 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8358269Z [rank3]:[W1204 13:27:48.331806116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8358445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8358702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8358883Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8359249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8359481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8359587Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8359694Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8359792Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8359795Z 2025-12-04T13:44:25.8360029Z [rank3]:[W1204 13:27:48.333622596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8360200Z [rank2]:[W1204 13:27:48.354098174 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8360375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8360630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8360795Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8361164Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8361368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8361474Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8361569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8361668Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8361670Z 2025-12-04T13:44:25.8361903Z [rank2]:[W1204 13:27:48.356101740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8362074Z [rank1]:[W1204 13:27:49.937167566 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8362248Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8362503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8362665Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8363041Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8363254Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8363368Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8363465Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8363570Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8363572Z 2025-12-04T13:44:25.8363806Z [rank1]:[W1204 13:27:49.938434898 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8363977Z [rank3]:[W1204 13:27:49.333986542 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8364153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8364411Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8364575Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8364942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8365144Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8365248Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8365345Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8365441Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8365444Z 2025-12-04T13:44:25.8365679Z [rank3]:[W1204 13:27:49.335247764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8365847Z [rank2]:[W1204 13:27:49.356242670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8366022Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8366278Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8366443Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8366818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8367022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8367138Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8367242Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8367339Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8367351Z 2025-12-04T13:44:25.8367629Z [rank2]:[W1204 13:27:49.358341584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8367801Z [rank1]:[W1204 13:27:50.938532009 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8367975Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8368232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8368394Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8368764Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8368966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8369071Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8369166Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8369262Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8369264Z 2025-12-04T13:44:25.8369497Z [rank1]:[W1204 13:27:50.939837171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8369670Z [rank3]:[W1204 13:27:50.335432994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8369848Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8370107Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8370270Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8370659Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8370860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8370965Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8371073Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8371182Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8371184Z 2025-12-04T13:44:25.8371418Z [rank3]:[W1204 13:27:50.336673676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8371602Z [rank2]:[W1204 13:27:50.358519164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8371778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8372037Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8372203Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8372569Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8372773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8372877Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8372974Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8373070Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8373072Z 2025-12-04T13:44:25.8373305Z [rank2]:[W1204 13:27:50.360474181 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8373478Z [rank1]:[W1204 13:27:51.939946823 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8373652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8373910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8374074Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8374445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8374667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8374770Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8374866Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8374971Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8374972Z 2025-12-04T13:44:25.8375215Z [rank1]:[W1204 13:27:51.941289333 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8375397Z [rank3]:[W1204 13:27:51.336858307 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8375574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8375830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8375993Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8376362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8376566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8376670Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8376764Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8376861Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8376863Z 2025-12-04T13:44:25.8377097Z [rank3]:[W1204 13:27:51.338098639 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8377266Z [rank2]:[W1204 13:27:51.360625192 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8377442Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8377735Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8377900Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8378270Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8378476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8378593Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8378690Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8378787Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8378801Z 2025-12-04T13:44:25.8379046Z [rank2]:[W1204 13:27:51.361892154 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8379216Z [rank1]:[W1204 13:27:52.941464454 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8379404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8379660Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8379821Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8380187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8380390Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8380495Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8380591Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8380687Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8380690Z 2025-12-04T13:44:25.8380924Z [rank1]:[W1204 13:27:52.943377002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8381094Z [rank3]:[W1204 13:27:52.338310440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8381269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8381528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8381689Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8382059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8382259Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8382371Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8382476Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8382573Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8382575Z 2025-12-04T13:44:25.8382809Z [rank3]:[W1204 13:27:52.340517231 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8383007Z [rank2]:[W1204 13:27:52.362075945 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8383183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8383450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8383613Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8383982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8384185Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8384291Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8384385Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8384483Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8384485Z 2025-12-04T13:44:25.8384717Z [rank2]:[W1204 13:27:52.364358504 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8384891Z [rank1]:[W1204 13:27:53.943495234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8385067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8385324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8385487Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8385852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8386054Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8386158Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8386254Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8386363Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8386365Z 2025-12-04T13:44:25.8386599Z [rank1]:[W1204 13:27:53.945090669 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8386779Z [rank3]:[W1204 13:27:53.340726402 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8386963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8387224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8387405Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8387804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8388007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8388112Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8388207Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8388304Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8388307Z 2025-12-04T13:44:25.8388540Z [rank3]:[W1204 13:27:53.343655017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8388710Z [rank2]:[W1204 13:27:53.364511086 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8388887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8389141Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8389306Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8389675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8389878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8389984Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8390078Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8390176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8390178Z 2025-12-04T13:44:25.8390424Z [rank2]:[W1204 13:27:53.366474133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8390595Z [rank1]:[W1204 13:27:54.945242311 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8390781Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8391052Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8391228Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8391597Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8391800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8391905Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8392003Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8392099Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8392101Z 2025-12-04T13:44:25.8392337Z [rank1]:[W1204 13:27:54.947104120 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8392507Z [rank3]:[W1204 13:27:54.343871308 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8392681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8392937Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8393099Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8393466Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8393667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8393773Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8393868Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8393964Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8393967Z 2025-12-04T13:44:25.8394210Z [rank3]:[W1204 13:27:54.346186927 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8394381Z [rank2]:[W1204 13:27:54.366624315 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8394557Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8394830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8394996Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8395372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8395576Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8395681Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8395777Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8395875Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8395878Z 2025-12-04T13:44:25.8396113Z [rank2]:[W1204 13:27:54.368257329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8396284Z [rank1]:[W1204 13:27:55.947511557 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8396458Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8396715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8396878Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8397247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8397450Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8397591Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8397688Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8397784Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8397785Z 2025-12-04T13:44:25.8398020Z [rank1]:[W1204 13:27:55.949921054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8398207Z [rank3]:[W1204 13:27:55.346401918 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8398382Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8398655Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8398829Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8399196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8399409Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8399515Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8399613Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8399710Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8399712Z 2025-12-04T13:44:25.8399945Z [rank3]:[W1204 13:27:55.348265787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8400115Z [rank2]:[W1204 13:27:55.368396322 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8400291Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8400548Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8400713Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8401079Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8401283Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8401388Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8401483Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8401580Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8401582Z 2025-12-04T13:44:25.8401814Z [rank2]:[W1204 13:27:55.370619753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8401984Z [rank1]:[W1204 13:27:56.950077007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8402168Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8402424Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8402597Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8402977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8403190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8403293Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8403389Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8403485Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8403487Z 2025-12-04T13:44:25.8403722Z [rank1]:[W1204 13:27:56.951928686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8403893Z [rank3]:[W1204 13:27:56.348479319 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8404068Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8404322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8404485Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8404859Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8405061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8405166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8405263Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8405359Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8405361Z 2025-12-04T13:44:25.8405594Z [rank3]:[W1204 13:27:56.350866666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8405763Z [rank2]:[W1204 13:27:56.371164368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8405939Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8406203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8406376Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8406754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8406965Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8407071Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8407167Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8407264Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8407267Z 2025-12-04T13:44:25.8407535Z [rank2]:[W1204 13:27:56.373424468 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8407705Z [rank1]:[W1204 13:27:57.952088439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8407880Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8408137Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8408299Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8408666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8408868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8408972Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8409068Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8409165Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8409167Z 2025-12-04T13:44:25.8409402Z [rank1]:[W1204 13:27:57.954108175 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8409572Z [rank3]:[W1204 13:27:57.351079988 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8409746Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8410021Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8410183Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8410576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8410778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8410896Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8410993Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8411088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8411090Z 2025-12-04T13:44:25.8411325Z [rank3]:[W1204 13:27:57.353174712 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8411497Z [rank2]:[W1204 13:27:57.373567702 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8411672Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8411928Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8412094Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8412464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8412666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8412772Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8412867Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8412964Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8412966Z 2025-12-04T13:44:25.8413198Z [rank2]:[W1204 13:27:57.375680875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8413371Z [rank1]:[W1204 13:27:58.954244079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8413547Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8413803Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8413975Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8414339Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8414560Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8414665Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8414772Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8414868Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8414870Z 2025-12-04T13:44:25.8415103Z [rank1]:[W1204 13:27:58.956957499 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8415274Z [rank3]:[W1204 13:27:58.353384035 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8415448Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8415707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8415873Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8416242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8416445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8416551Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8416648Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8416743Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8416745Z 2025-12-04T13:44:25.8416981Z [rank3]:[W1204 13:27:58.355760002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8417151Z [rank2]:[W1204 13:27:58.375814970 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8417327Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8417615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8417784Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8418166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8418386Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8418504Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8418600Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8418711Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8418713Z 2025-12-04T13:44:25.8418946Z [rank2]:[W1204 13:27:58.377038612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8419115Z [rank1]:[W1204 13:27:59.957096143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8419290Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8419545Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8419709Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8420077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8420280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8420384Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8420480Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8420576Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8420580Z 2025-12-04T13:44:25.8420812Z [rank1]:[W1204 13:27:59.959151728 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8420984Z [rank3]:[W1204 13:27:59.355961986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8421159Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8421417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8421581Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8421958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8422161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8422280Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8422386Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8422482Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8422494Z 2025-12-04T13:44:25.8422731Z [rank3]:[W1204 13:27:59.357991031 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8422900Z [rank2]:[W1204 13:27:59.377175947 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8423074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8423331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8423494Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8423863Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8424064Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8424171Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8424266Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8424363Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8424366Z 2025-12-04T13:44:25.8424600Z [rank2]:[W1204 13:27:59.379304270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8424774Z [rank1]:[W1204 13:28:00.959323002 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8424948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8425203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8425367Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8425743Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8425944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8426058Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8426154Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8426258Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8426262Z 2025-12-04T13:44:25.8426496Z [rank1]:[W1204 13:28:00.961810327 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8426678Z [rank3]:[W1204 13:28:00.358178105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8426852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8427108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8427271Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8427666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8427871Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8427974Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8428070Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8428166Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8428168Z 2025-12-04T13:44:25.8428401Z [rank3]:[W1204 13:28:00.359413558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8428570Z [rank2]:[W1204 13:28:00.379464695 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8428746Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8429002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8429167Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8429536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8429755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8429859Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8429966Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8430064Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8430066Z 2025-12-04T13:44:25.8430314Z [rank2]:[W1204 13:28:00.381112329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8430498Z [rank1]:[W1204 13:28:01.961955583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8430673Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8430927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8431093Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8431459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8431662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8431764Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8431861Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8431957Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8431960Z 2025-12-04T13:44:25.8432192Z [rank1]:[W1204 13:28:01.964074296 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8432362Z [rank3]:[W1204 13:28:01.359627102 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8432536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8432792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8432954Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8433329Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8433546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8433650Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8433746Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8433852Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8433854Z 2025-12-04T13:44:25.8434098Z [rank3]:[W1204 13:28:01.361814993 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8434268Z [rank2]:[W1204 13:28:01.381299163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8434452Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8434706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8434869Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8435238Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8435444Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8435551Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8435645Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8435742Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8435745Z 2025-12-04T13:44:25.8435978Z [rank2]:[W1204 13:28:01.383413877 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8436151Z [rank1]:[W1204 13:28:02.964231631 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8436326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8436580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8436743Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8437112Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8437314Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8437418Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8437558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8437654Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8437657Z 2025-12-04T13:44:25.8437902Z [rank1]:[W1204 13:28:02.965511913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8438084Z [rank3]:[W1204 13:28:02.362006408 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8438259Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8438528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8438692Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8439060Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8439262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8439367Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8439464Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8439560Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8439561Z 2025-12-04T13:44:25.8439798Z [rank3]:[W1204 13:28:02.363710591 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8439969Z [rank2]:[W1204 13:28:02.383567772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8440144Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8440401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8440566Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8440934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8441138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8441244Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8441339Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8441448Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8441450Z 2025-12-04T13:44:25.8441683Z [rank2]:[W1204 13:28:02.385723085 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8441866Z [rank1]:[W1204 13:28:03.965674559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8442054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8442327Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8442492Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8442859Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8443065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8443168Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8443265Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8443363Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8443365Z 2025-12-04T13:44:25.8443597Z [rank1]:[W1204 13:28:03.967894130 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8443767Z [rank3]:[W1204 13:28:03.363897686 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8443941Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8444200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8444364Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8444731Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8444934Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8445037Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8445134Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8445230Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8445232Z 2025-12-04T13:44:25.8445479Z [rank3]:[W1204 13:28:03.365836953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8445649Z [rank2]:[W1204 13:28:03.385883270 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8445845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8446100Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8446284Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8446652Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8446854Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8446959Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8447054Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8447153Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8447155Z 2025-12-04T13:44:25.8447389Z [rank2]:[W1204 13:28:03.388103362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8447598Z [rank1]:[W1204 13:28:04.967994477 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8447774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8448031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8448195Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8448564Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8448768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8448873Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8448969Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8449066Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8449067Z 2025-12-04T13:44:25.8449315Z [rank1]:[W1204 13:28:04.970448963 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8449486Z [rank3]:[W1204 13:28:04.366034729 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8449672Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8449940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8450103Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8450484Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8450687Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8450792Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8450889Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8450985Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8450987Z 2025-12-04T13:44:25.8451223Z [rank3]:[W1204 13:28:04.367814249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8451392Z [rank2]:[W1204 13:28:04.388223319 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8451567Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8451823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8451986Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8452358Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8452560Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8452665Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8452762Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8452860Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8452862Z 2025-12-04T13:44:25.8453096Z [rank2]:[W1204 13:28:04.390318272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8453277Z [rank1]:[W1204 13:28:05.970592710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8453452Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8453716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8453888Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8454265Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8454466Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8454572Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8454667Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8454764Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8454766Z 2025-12-04T13:44:25.8455002Z [rank1]:[W1204 13:28:05.972936198 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8455175Z [rank3]:[W1204 13:28:05.367977906 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8455349Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8455605Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8455769Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8456137Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8456339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8456441Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8456538Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8456633Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8456636Z 2025-12-04T13:44:25.8456870Z [rank3]:[W1204 13:28:05.369200249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8457050Z [rank2]:[W1204 13:28:05.390489779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8457226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8457507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8457697Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8458066Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8458281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8458385Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8458481Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8458577Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8458579Z 2025-12-04T13:44:25.8458812Z [rank2]:[W1204 13:28:05.392835577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8458981Z [rank1]:[W1204 13:28:06.973093556 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8459157Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8459414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8459579Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8459943Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8460145Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8460250Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8460345Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8460442Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8460444Z 2025-12-04T13:44:25.8460677Z [rank1]:[W1204 13:28:06.974420356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8460847Z [rank3]:[W1204 13:28:06.369390946 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8461035Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8461290Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8461464Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8461848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8462059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8462163Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8462259Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8462355Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8462357Z 2025-12-04T13:44:25.8462592Z [rank3]:[W1204 13:28:06.371106918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8462764Z [rank2]:[W1204 13:28:06.392963415 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8462940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8463198Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8463361Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8463732Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8463935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8464040Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8464138Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8464235Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8464238Z 2025-12-04T13:44:25.8464470Z [rank2]:[W1204 13:28:06.395188916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8464639Z [rank1]:[W1204 13:28:07.974583384 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8464814Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8465078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8465241Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8465625Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8465827Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8465944Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8466040Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8466137Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8466138Z 2025-12-04T13:44:25.8466372Z [rank1]:[W1204 13:28:07.976488301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8466542Z [rank3]:[W1204 13:28:07.371698576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8466717Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8466977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8467139Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8467535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8467737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8467841Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8467938Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8468034Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8468036Z 2025-12-04T13:44:25.8468270Z [rank3]:[W1204 13:28:07.373867248 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8468440Z [rank2]:[W1204 13:28:07.395316544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8468614Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8468885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8469046Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8469413Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8469641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8469759Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8469855Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8469953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8469955Z 2025-12-04T13:44:25.8470191Z [rank2]:[W1204 13:28:07.397278361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8470362Z [rank1]:[W1204 13:28:08.976599000 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8470537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8470792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8470956Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8471321Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8471524Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8471628Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8471724Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8471821Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8471823Z 2025-12-04T13:44:25.8472056Z [rank1]:[W1204 13:28:08.978022669 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8472227Z [rank3]:[W1204 13:28:08.374053115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8472402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8472662Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8472836Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8473202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8473423Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8473527Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8473634Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8473731Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8473733Z 2025-12-04T13:44:25.8473966Z [rank3]:[W1204 13:28:08.375334227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8474135Z [rank2]:[W1204 13:28:08.397436489 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8474310Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8474567Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8474731Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8475101Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8475303Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8475409Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8475505Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8475603Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8475605Z 2025-12-04T13:44:25.8475839Z [rank2]:[W1204 13:28:08.399186431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8476009Z [rank1]:[W1204 13:28:09.978350644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8476185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8476442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8476607Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8476987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8477205Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8477320Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8477415Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8477561Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8477563Z 2025-12-04T13:44:25.8477800Z [rank1]:[W1204 13:28:09.980696982 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8477969Z [rank3]:[W1204 13:28:09.375525455 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8478144Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8478399Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8478561Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8478927Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8479130Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8479235Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8479333Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8479429Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8479432Z 2025-12-04T13:44:25.8479667Z [rank3]:[W1204 13:28:09.377270016 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8479836Z [rank2]:[W1204 13:28:09.399330799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8480012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8480269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8480432Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8480815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8481016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8481135Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8481232Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8481341Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8481343Z 2025-12-04T13:44:25.8481590Z [rank2]:[W1204 13:28:09.401364084 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8481759Z [rank1]:[W1204 13:28:10.980890380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8481935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8482192Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8482356Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8482724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8482927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8483034Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8483129Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8483226Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8483228Z 2025-12-04T13:44:25.8483462Z [rank1]:[W1204 13:28:10.982439355 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8483635Z [rank3]:[W1204 13:28:10.377450715 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8483808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8484064Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8484227Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8484591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8484804Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8484909Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8485015Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8485111Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8485123Z 2025-12-04T13:44:25.8485356Z [rank3]:[W1204 13:28:10.379530369 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8485537Z [rank2]:[W1204 13:28:10.401505004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8485714Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8485972Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8486136Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8486502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8486705Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8486811Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8486907Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8487005Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8487006Z 2025-12-04T13:44:25.8487241Z [rank2]:[W1204 13:28:10.403789383 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8487412Z [rank1]:[W1204 13:28:11.982619964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8487628Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8487886Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8488051Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8488414Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8488629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8488734Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8488828Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8488936Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8488938Z 2025-12-04T13:44:25.8489185Z [rank1]:[W1204 13:28:11.984916133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8489355Z [rank3]:[W1204 13:28:11.379698468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8489542Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8489797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8489961Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8490330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8490534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8490638Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8490733Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8490828Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8490831Z 2025-12-04T13:44:25.8491064Z [rank3]:[W1204 13:28:11.381598216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8491236Z [rank2]:[W1204 13:28:11.403952532 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8491411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8491667Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8491829Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8492198Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8492401Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8492527Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8492622Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8492718Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8492730Z 2025-12-04T13:44:25.8492964Z [rank2]:[W1204 13:28:11.406022537 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8493144Z [rank1]:[W1204 13:28:12.985093653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8493331Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8493585Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8493748Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8494114Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8494316Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8494423Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8494520Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8494617Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8494619Z 2025-12-04T13:44:25.8494852Z [rank1]:[W1204 13:28:12.987183236 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8495023Z [rank3]:[W1204 13:28:12.381802125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8495196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8495454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8495617Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8495982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8496187Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8496291Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8496397Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8496494Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8496495Z 2025-12-04T13:44:25.8496731Z [rank3]:[W1204 13:28:12.384388207 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8496922Z [rank2]:[W1204 13:28:12.406164667 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8497097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8497363Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8497565Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8497933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8498135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8498241Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8498337Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8498433Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8498435Z 2025-12-04T13:44:25.8498670Z [rank2]:[W1204 13:28:12.407936858 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8498843Z [rank1]:[W1204 13:28:13.987349156 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8499020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8499274Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8499438Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8499803Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8500006Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8500112Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8500208Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8500318Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8500319Z 2025-12-04T13:44:25.8500552Z [rank1]:[W1204 13:28:13.989836201 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8500736Z [rank3]:[W1204 13:28:13.384577657 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8500924Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8501184Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8501361Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8501726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8501929Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8502033Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8502129Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8502224Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8502226Z 2025-12-04T13:44:25.8502459Z [rank3]:[W1204 13:28:13.386290559 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8502631Z [rank2]:[W1204 13:28:13.408088638 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8502804Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8503061Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8503226Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8503596Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8503798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8503903Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8504000Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8504096Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8504098Z 2025-12-04T13:44:25.8504340Z [rank2]:[W1204 13:28:13.410257880 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8504509Z [rank1]:[W1204 13:28:14.990012531 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8504694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8504956Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8505135Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8505506Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8505709Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8505816Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8505913Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8506009Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8506012Z 2025-12-04T13:44:25.8506244Z [rank1]:[W1204 13:28:14.992442478 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8506414Z [rank3]:[W1204 13:28:14.386472169 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8506588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8506845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8507008Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8507378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8507618Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8507723Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8507818Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8507915Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8507918Z 2025-12-04T13:44:25.8508169Z [rank3]:[W1204 13:28:14.388408526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8508338Z [rank2]:[W1204 13:28:14.410372321 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8508513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8508793Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8508955Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8509335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8509537Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8509643Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8509740Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8509838Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8509840Z 2025-12-04T13:44:25.8510075Z [rank2]:[W1204 13:28:14.412385267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8510244Z [rank1]:[W1204 13:28:15.992555559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8510420Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8510675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8510839Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8511206Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8511406Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8511512Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8511606Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8511702Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8511704Z 2025-12-04T13:44:25.8511938Z [rank1]:[W1204 13:28:15.994591884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8512119Z [rank3]:[W1204 13:28:15.388609116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8512293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8512549Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8512734Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8513099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8513311Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8513415Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8513511Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8513606Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8513609Z 2025-12-04T13:44:25.8513844Z [rank3]:[W1204 13:28:15.390522694 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8514019Z [rank2]:[W1204 13:28:15.412526858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8514195Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8514450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8514614Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8514983Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8515186Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8515290Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8515386Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8515483Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8515485Z 2025-12-04T13:44:25.8515718Z [rank2]:[W1204 13:28:15.415043983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8515887Z [rank1]:[W1204 13:28:16.994712636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8516072Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8516332Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8516507Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8516882Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8517093Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8517198Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8517292Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8517390Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8517391Z 2025-12-04T13:44:25.8517654Z [rank1]:[W1204 13:28:16.996700352 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8517824Z [rank3]:[W1204 13:28:16.390706295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8518000Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8518255Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8518419Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8518786Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8518989Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8519094Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8519190Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8519286Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8519290Z 2025-12-04T13:44:25.8519523Z [rank3]:[W1204 13:28:16.392484726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8519693Z [rank2]:[W1204 13:28:16.415205074 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8519867Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8520136Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8520299Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8520705Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8520919Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8521024Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8521121Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8521217Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8521220Z 2025-12-04T13:44:25.8521453Z [rank2]:[W1204 13:28:16.416448086 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8521623Z [rank1]:[W1204 13:28:17.996824175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8521799Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8522053Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8522215Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8522581Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8522782Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8522890Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8522985Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8523082Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8523084Z 2025-12-04T13:44:25.8523315Z [rank1]:[W1204 13:28:17.998799171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8523486Z [rank3]:[W1204 13:28:17.392696486 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8523661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8523925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8524088Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8524462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8524674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8524788Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8524885Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8524980Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8524983Z 2025-12-04T13:44:25.8525216Z [rank3]:[W1204 13:28:17.394984675 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8525387Z [rank2]:[W1204 13:28:17.416546329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8525561Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8525817Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8525980Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8526347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8526551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8526655Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8526750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8526847Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8526849Z 2025-12-04T13:44:25.8527083Z [rank2]:[W1204 13:28:17.418534946 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8527257Z [rank1]:[W1204 13:28:18.998977573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8527433Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8527736Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8527918Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8528284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8528509Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8528614Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8528720Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8528817Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8528820Z 2025-12-04T13:44:25.8529052Z [rank1]:[W1204 13:28:18.001578485 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8529223Z [rank3]:[W1204 13:28:18.395172297 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8529402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8529659Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8529823Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8530187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8530391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8530494Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8530591Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8530685Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8530688Z 2025-12-04T13:44:25.8530921Z [rank3]:[W1204 13:28:18.397552924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8531092Z [rank2]:[W1204 13:28:18.418679858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8531266Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8531522Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8531688Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8532065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8532276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8532391Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8532489Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8532595Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8532597Z 2025-12-04T13:44:25.8532831Z [rank2]:[W1204 13:28:18.420803531 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8533000Z [rank1]:[W1204 13:28:19.001710078 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8533175Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8533429Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8533593Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8533965Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8534166Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8534270Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8534365Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8534463Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8534466Z 2025-12-04T13:44:25.8534699Z [rank1]:[W1204 13:28:19.003759953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8534868Z [rank3]:[W1204 13:28:19.397692317 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8535044Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8535300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8535463Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8535841Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8536045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8536164Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8536272Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8536369Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8536382Z 2025-12-04T13:44:25.8536615Z [rank3]:[W1204 13:28:19.398952099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8536787Z [rank2]:[W1204 13:28:19.421132900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8536961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8537220Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8537381Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8537789Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8537993Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8538099Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8538195Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8538293Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8538294Z 2025-12-04T13:44:25.8538527Z [rank2]:[W1204 13:28:19.423210874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8538697Z [rank1]:[W1204 13:28:20.003917026 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8538874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8539128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8539291Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8539670Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8539872Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8539991Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8540086Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8540195Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8540197Z 2025-12-04T13:44:25.8540430Z [rank1]:[W1204 13:28:20.005181848 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8540614Z [rank3]:[W1204 13:28:20.399101343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8540791Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8541046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8541210Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8541575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8541779Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8541882Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8541979Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8542077Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8542078Z 2025-12-04T13:44:25.8542312Z [rank3]:[W1204 13:28:20.400335305 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8542483Z [rank2]:[W1204 13:28:20.423350567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8542660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8542916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8543079Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8543449Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8543664Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8543768Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8543865Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8543973Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8543975Z 2025-12-04T13:44:25.8544219Z [rank2]:[W1204 13:28:20.425557009 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8544400Z [rank1]:[W1204 13:28:21.005327941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8544576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8544833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8544999Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8545365Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8545567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8545671Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8545765Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8545863Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8545865Z 2025-12-04T13:44:25.8546098Z [rank1]:[W1204 13:28:21.006614503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8546269Z [rank3]:[W1204 13:28:21.400623136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8546445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8546701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8546866Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8547234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8547435Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8547582Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8547679Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8547775Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8547790Z 2025-12-04T13:44:25.8548035Z [rank3]:[W1204 13:28:21.401994265 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8548207Z [rank2]:[W1204 13:28:21.425713462 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8548395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8548654Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8548817Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8549193Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8549398Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8549502Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8549598Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8549694Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8549697Z 2025-12-04T13:44:25.8549930Z [rank2]:[W1204 13:28:21.427441874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8550100Z [rank1]:[W1204 13:28:22.006750987 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8550277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8550532Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8550694Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8551062Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8551266Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8551372Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8551483Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8551580Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8551582Z 2025-12-04T13:44:25.8551815Z [rank1]:[W1204 13:28:22.009283961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8552004Z [rank3]:[W1204 13:28:22.402166629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8552182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8552451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8552614Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8552980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8553183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8553288Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8553383Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8553481Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8553483Z 2025-12-04T13:44:25.8553717Z [rank3]:[W1204 13:28:22.403407072 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8553890Z [rank2]:[W1204 13:28:22.427622007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8554064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8554319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8554482Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8554849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8555052Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8555157Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8555253Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8555358Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8555360Z 2025-12-04T13:44:25.8555597Z [rank2]:[W1204 13:28:22.429652362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8555778Z [rank1]:[W1204 13:28:23.009376496 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8555962Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8556219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8556393Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8556759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8556962Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8557067Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8557162Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8557258Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8557261Z 2025-12-04T13:44:25.8557542Z [rank1]:[W1204 13:28:23.011637966 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8557713Z [rank3]:[W1204 13:28:23.403751942 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8557892Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8558148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8558313Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8558677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8558881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8558987Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8559081Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8559180Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8559182Z 2025-12-04T13:44:25.8559431Z [rank3]:[W1204 13:28:23.405009804 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8559602Z [rank2]:[W1204 13:28:23.429812346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8559789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8560059Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8560236Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8560606Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8560808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8560912Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8561008Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8561105Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8561106Z 2025-12-04T13:44:25.8561342Z [rank2]:[W1204 13:28:23.431694095 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8561511Z [rank1]:[W1204 13:28:24.011756292 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8561687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8561942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8562106Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8562475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8562675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8562780Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8562876Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8562973Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8562976Z 2025-12-04T13:44:25.8563219Z [rank1]:[W1204 13:28:24.013899434 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8563389Z [rank3]:[W1204 13:28:24.405154269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8563563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8563838Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8564001Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8564387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8564590Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8564696Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8564792Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8564891Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8564894Z 2025-12-04T13:44:25.8565126Z [rank3]:[W1204 13:28:24.406393141 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8565297Z [rank2]:[W1204 13:28:24.431879659 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8565469Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8565727Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8565890Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8566258Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8566462Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8566569Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8566666Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8566763Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8566765Z 2025-12-04T13:44:25.8566999Z [rank2]:[W1204 13:28:24.434003012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8567180Z [rank1]:[W1204 13:28:25.014080439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8567356Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8567671Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8567845Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8568212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8568427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8568531Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8568626Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8568724Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8568726Z 2025-12-04T13:44:25.8568963Z [rank1]:[W1204 13:28:25.016384668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8569134Z [rank3]:[W1204 13:28:25.406539886 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8569308Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8569563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8569729Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8570094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8570297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8570401Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8570498Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8570595Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8570597Z 2025-12-04T13:44:25.8570829Z [rank3]:[W1204 13:28:25.407801949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8571002Z [rank2]:[W1204 13:28:25.434169987 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8571191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8571450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8571624Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8572001Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8572215Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8572318Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8572415Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8572511Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8572513Z 2025-12-04T13:44:25.8572748Z [rank2]:[W1204 13:28:25.435602735 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8572918Z [rank1]:[W1204 13:28:26.016542113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8573096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8573353Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8573517Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8573884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8574086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8574191Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8574286Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8574384Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8574386Z 2025-12-04T13:44:25.8574620Z [rank1]:[W1204 13:28:26.018160167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8574789Z [rank3]:[W1204 13:28:26.407967044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8574964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8575228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8575403Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8575780Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8575992Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8576099Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8576194Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8576290Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8576293Z 2025-12-04T13:44:25.8576525Z [rank3]:[W1204 13:28:26.409367593 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8576695Z [rank2]:[W1204 13:28:26.435795479 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8576871Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8577128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8577292Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8577706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8577911Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8578016Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8578112Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8578208Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8578211Z 2025-12-04T13:44:25.8578445Z [rank2]:[W1204 13:28:26.437718907 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8578617Z [rank1]:[W1204 13:28:27.018288313 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8578792Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8579059Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8579221Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8579624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8579826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8579945Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8580041Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8580137Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8580139Z 2025-12-04T13:44:25.8580371Z [rank1]:[W1204 13:28:27.020888666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8580543Z [rank3]:[W1204 13:28:27.409513049 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8580718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8580974Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8581137Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8581501Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8581707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8581813Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8581910Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8582010Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8582012Z 2025-12-04T13:44:25.8582247Z [rank3]:[W1204 13:28:27.410757421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8582418Z [rank2]:[W1204 13:28:27.437842963 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8582593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8582849Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8583024Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8583390Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8583613Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8583716Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8583823Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8583920Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8583922Z 2025-12-04T13:44:25.8584157Z [rank2]:[W1204 13:28:27.440124843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8584330Z [rank1]:[W1204 13:28:28.021052592 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8584505Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8584761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8584926Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8585293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8585496Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8585601Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8585698Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8585797Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8585800Z 2025-12-04T13:44:25.8586033Z [rank1]:[W1204 13:28:28.023676154 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8586205Z [rank3]:[W1204 13:28:28.410927547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8586384Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8586641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8586806Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8587180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8587392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8587552Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8587647Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8587758Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8587759Z 2025-12-04T13:44:25.8587993Z [rank3]:[W1204 13:28:28.412349956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8588166Z [rank2]:[W1204 13:28:28.440266370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8588343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8588602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8588768Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8589135Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8589339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8589444Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8589541Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8589637Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8589640Z 2025-12-04T13:44:25.8589876Z [rank2]:[W1204 13:28:28.442082030 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8590047Z [rank1]:[W1204 13:28:29.023825930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8590221Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8590478Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8590640Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8591022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8591223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8591342Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8591448Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8591545Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8591557Z 2025-12-04T13:44:25.8591791Z [rank1]:[W1204 13:28:29.025920924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8591960Z [rank3]:[W1204 13:28:29.412506752 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8592136Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8592392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8592555Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8592923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8593127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8593233Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8593329Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8593425Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8593428Z 2025-12-04T13:44:25.8593659Z [rank3]:[W1204 13:28:29.413748495 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8593832Z [rank2]:[W1204 13:28:29.442246186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8594006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8594263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8594427Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8594803Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8595008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8595129Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8595225Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8595335Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8595337Z 2025-12-04T13:44:25.8595571Z [rank2]:[W1204 13:28:29.444169863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8595751Z [rank1]:[W1204 13:28:30.026080991 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8595925Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8596182Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8596344Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8596710Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8596913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8597019Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8597116Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8597212Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8597214Z 2025-12-04T13:44:25.8601134Z [rank1]:[W1204 13:28:30.028374740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8601313Z [rank3]:[W1204 13:28:30.413901142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8601491Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8601749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8601915Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8602281Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8602532Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8602638Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8602747Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8602844Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8602846Z 2025-12-04T13:44:25.8603093Z [rank3]:[W1204 13:28:30.415287351 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8603281Z [rank2]:[W1204 13:28:30.444323100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8603459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8603715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8603879Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8604246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8604452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8604555Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8604653Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8604751Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8604753Z 2025-12-04T13:44:25.8604987Z [rank2]:[W1204 13:28:30.445569623 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8605157Z [rank1]:[W1204 13:28:31.028482788 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8605332Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8605590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8605753Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8606121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8606334Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8606439Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8606535Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8606640Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8606642Z 2025-12-04T13:44:25.8606885Z [rank1]:[W1204 13:28:31.030253949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8607056Z [rank3]:[W1204 13:28:31.415450478 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8607243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8607549Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8607715Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8608086Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8608288Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8608392Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8608486Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8608582Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8608585Z 2025-12-04T13:44:25.8608817Z [rank3]:[W1204 13:28:31.416910586 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8608986Z [rank2]:[W1204 13:28:31.445724940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8609161Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8609416Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8609580Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8609953Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8610157Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8610261Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8610377Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8610474Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8610489Z 2025-12-04T13:44:25.8610722Z [rank2]:[W1204 13:28:31.446978063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8610903Z [rank1]:[W1204 13:28:32.030358608 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8611078Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8611347Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8611508Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8611875Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8612078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8612184Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8612280Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8612375Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8612376Z 2025-12-04T13:44:25.8612609Z [rank1]:[W1204 13:28:32.032802684 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8612779Z [rank3]:[W1204 13:28:32.417057224 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8612953Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8613208Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8613370Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8613734Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8613936Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8614041Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8614137Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8614251Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8614253Z 2025-12-04T13:44:25.8614486Z [rank3]:[W1204 13:28:32.418292077 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8614667Z [rank2]:[W1204 13:28:32.447155940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8614849Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8615117Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8615281Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8615647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8615852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8615956Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8616054Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8616151Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8616153Z 2025-12-04T13:44:25.8616391Z [rank2]:[W1204 13:28:32.449153705 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8616564Z [rank1]:[W1204 13:28:33.032918492 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8616738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8616993Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8617156Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8617555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8617758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8617862Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8617959Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8618056Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8618058Z 2025-12-04T13:44:25.8618310Z [rank1]:[W1204 13:28:33.035033426 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8618481Z [rank3]:[W1204 13:28:33.418458944 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8618681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8618938Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8619114Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8619484Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8619686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8619792Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8619887Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8619985Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8619987Z 2025-12-04T13:44:25.8620221Z [rank3]:[W1204 13:28:33.420146027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8620392Z [rank2]:[W1204 13:28:33.449658556 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8620567Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8620826Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8620991Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8621356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8621558Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8621662Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8621757Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8621854Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8621855Z 2025-12-04T13:44:25.8622097Z [rank2]:[W1204 13:28:33.452107612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8622267Z [rank1]:[W1204 13:28:34.035134275 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8622453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8622717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8622878Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8623257Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8623457Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8623562Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8623659Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8623755Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8623757Z 2025-12-04T13:44:25.8623991Z [rank1]:[W1204 13:28:34.036974155 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8624161Z [rank3]:[W1204 13:28:34.420291106 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8624335Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8624592Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8624755Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8625129Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8625333Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8625439Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8625533Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8625629Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8625632Z 2025-12-04T13:44:25.8625863Z [rank3]:[W1204 13:28:34.422307071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8626048Z [rank2]:[W1204 13:28:34.452268950 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8626222Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8626490Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8626664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8627045Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8627247Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8627351Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8627447Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8627581Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8627584Z 2025-12-04T13:44:25.8627816Z [rank2]:[W1204 13:28:34.453888464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8627988Z [rank1]:[W1204 13:28:35.037114004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8628162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8628417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8628580Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8628944Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8629145Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8629247Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8629344Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8629440Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8629442Z 2025-12-04T13:44:25.8629680Z [rank1]:[W1204 13:28:35.038366216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8629864Z [rank3]:[W1204 13:28:35.422431271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8630039Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8630293Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8630480Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8630845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8631062Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8631166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8631261Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8631356Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8631359Z 2025-12-04T13:44:25.8631594Z [rank3]:[W1204 13:28:35.423716232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8631768Z [rank2]:[W1204 13:28:35.454074092 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8631944Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8632198Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8632365Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8632730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8632933Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8633037Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8633133Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8633230Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8633233Z 2025-12-04T13:44:25.8633466Z [rank2]:[W1204 13:28:35.456251594 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8633636Z [rank1]:[W1204 13:28:36.038478856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8633822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8634078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8634250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8634625Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8634835Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8634939Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8635035Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8635131Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8635133Z 2025-12-04T13:44:25.8635366Z [rank1]:[W1204 13:28:36.040617809 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8635535Z [rank3]:[W1204 13:28:36.423853632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8635711Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8635968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8636133Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8636502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8636703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8636808Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8636902Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8636998Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8637001Z 2025-12-04T13:44:25.8637233Z [rank3]:[W1204 13:28:36.425849278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8637403Z [rank2]:[W1204 13:28:36.456398864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8637611Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8637878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8638042Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8638446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8638647Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8638763Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8638860Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8638956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8638959Z 2025-12-04T13:44:25.8639193Z [rank2]:[W1204 13:28:36.458689323 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8639364Z [rank1]:[W1204 13:28:37.040734399 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8639537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8639794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8639956Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8640324Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8640528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8640632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8640728Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8640823Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8640825Z 2025-12-04T13:44:25.8641058Z [rank1]:[W1204 13:28:37.042964530 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8641229Z [rank3]:[W1204 13:28:37.425972718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8641404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8641669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8641832Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8642200Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8642427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8642545Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8642639Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8642739Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8642741Z 2025-12-04T13:44:25.8642974Z [rank3]:[W1204 13:28:37.427691050 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8643145Z [rank2]:[W1204 13:28:37.458841323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8643320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8643574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8643738Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8644103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8644307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8644410Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8644507Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8644604Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8644606Z 2025-12-04T13:44:25.8644839Z [rank2]:[W1204 13:28:37.460154064 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8645011Z [rank1]:[W1204 13:28:38.043100580 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8645185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8645440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8645613Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8645978Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8646201Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8646305Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8646415Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8646510Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8646511Z 2025-12-04T13:44:25.8646745Z [rank1]:[W1204 13:28:38.044800083 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8646915Z [rank3]:[W1204 13:28:38.427865250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8647091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8647346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8647541Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8647908Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8648109Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8648214Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8648308Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8648405Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8648407Z 2025-12-04T13:44:25.8648640Z [rank3]:[W1204 13:28:38.429905045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8648812Z [rank2]:[W1204 13:28:38.460338663 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8648987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8649244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8649409Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8649787Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8650002Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8650118Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8650215Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8650326Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8650328Z 2025-12-04T13:44:25.8650562Z [rank2]:[W1204 13:28:38.462828679 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8650732Z [rank1]:[W1204 13:28:39.045048631 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8650906Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8651163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8651325Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8651698Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8651898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8652002Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8652098Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8652193Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8652196Z 2025-12-04T13:44:25.8652430Z [rank1]:[W1204 13:28:39.047155055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8652599Z [rank3]:[W1204 13:28:39.430061125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8652774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8653030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8653192Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8653572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8653772Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8653886Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8653989Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8654087Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8654088Z 2025-12-04T13:44:25.8654330Z [rank3]:[W1204 13:28:39.431457855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8654501Z [rank2]:[W1204 13:28:39.462898421 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8654676Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8654930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8655095Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8655463Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8655666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8655771Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8655867Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8655964Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8655966Z 2025-12-04T13:44:25.8656199Z [rank2]:[W1204 13:28:39.465247289 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8656372Z [rank1]:[W1204 13:28:40.047317635 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8656545Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8656801Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8656965Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8657331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8657582Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8657686Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8657801Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8657897Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8657910Z 2025-12-04T13:44:25.8658150Z [rank1]:[W1204 13:28:40.049386479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8658333Z [rank3]:[W1204 13:28:40.431598386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8658508Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8658765Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8658929Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8659296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8659498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8659602Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8659697Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8659794Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8659796Z 2025-12-04T13:44:25.8660030Z [rank3]:[W1204 13:28:40.433286589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8660203Z [rank2]:[W1204 13:28:40.465377640 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8660380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8660634Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8660798Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8661164Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8661375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8661480Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8661575Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8661682Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8661683Z 2025-12-04T13:44:25.8661927Z [rank2]:[W1204 13:28:40.467501074 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8662098Z [rank1]:[W1204 13:28:41.049529771 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8662286Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8662544Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8662706Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8663076Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8663279Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8663382Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8663478Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8663573Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8663576Z 2025-12-04T13:44:25.8663809Z [rank1]:[W1204 13:28:41.051010018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8663979Z [rank3]:[W1204 13:28:41.433422510 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8664154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8664413Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8664576Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8664945Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8665148Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8665264Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8665359Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8665456Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8665468Z 2025-12-04T13:44:25.8665702Z [rank3]:[W1204 13:28:41.434652833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8665882Z [rank2]:[W1204 13:28:41.467669644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8666067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8666325Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8666489Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8666865Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8667069Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8667174Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8667270Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8667367Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8667369Z 2025-12-04T13:44:25.8667637Z [rank2]:[W1204 13:28:41.469150012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8667808Z [rank1]:[W1204 13:28:42.051154890 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8667982Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8668238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8668399Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8668769Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8668973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8669077Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8669187Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8669282Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8669284Z 2025-12-04T13:44:25.8669519Z [rank1]:[W1204 13:28:42.053483608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8669715Z [rank3]:[W1204 13:28:42.434785825 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8669891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8670162Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8670323Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8670690Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8670892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8670998Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8671094Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8671192Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8671194Z 2025-12-04T13:44:25.8671426Z [rank3]:[W1204 13:28:42.436495857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8671596Z [rank2]:[W1204 13:28:42.469292334 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8671771Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8672025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8672191Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8672556Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8672759Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8672866Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8672962Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8673077Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8673079Z 2025-12-04T13:44:25.8673315Z [rank2]:[W1204 13:28:42.470720082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8673497Z [rank1]:[W1204 13:28:43.053632270 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8673681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8673938Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8674111Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8674478Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8674681Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8674784Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8674881Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8674977Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8674979Z 2025-12-04T13:44:25.8675213Z [rank1]:[W1204 13:28:43.055769283 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8675382Z [rank3]:[W1204 13:28:43.436613550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8675558Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8675813Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8675974Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8676341Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8676541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8676647Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8676743Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8676840Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8676842Z 2025-12-04T13:44:25.8677086Z [rank3]:[W1204 13:28:43.438530338 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8677256Z [rank2]:[W1204 13:28:43.470837585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8677446Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8677741Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8677917Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8678284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8678487Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8678592Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8678687Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8678784Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8678787Z 2025-12-04T13:44:25.8679018Z [rank2]:[W1204 13:28:43.472946838 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8679188Z [rank1]:[W1204 13:28:44.055913846 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8679362Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8679619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8679781Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8680152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8680353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8680457Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8680553Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8680649Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8680652Z 2025-12-04T13:44:25.8680897Z [rank1]:[W1204 13:28:44.057968790 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8681066Z [rank3]:[W1204 13:28:44.438674990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8681242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8681522Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8681683Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8682065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8682265Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8682370Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8682465Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8682562Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8682564Z 2025-12-04T13:44:25.8682797Z [rank3]:[W1204 13:28:44.440245976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8682966Z [rank2]:[W1204 13:28:44.473089951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8683140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8683395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8683558Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8683924Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8684132Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8684239Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8684335Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8684432Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8684434Z 2025-12-04T13:44:25.8684666Z [rank2]:[W1204 13:28:44.475052738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8684849Z [rank1]:[W1204 13:28:45.058131433 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8685023Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8685279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8685462Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8685828Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8686046Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8686150Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8686247Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8686342Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8686345Z 2025-12-04T13:44:25.8686580Z [rank1]:[W1204 13:28:45.060283305 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8686750Z [rank3]:[W1204 13:28:45.440418668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8686926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8687182Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8687344Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8687729Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8687931Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8688036Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8688132Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8688228Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8688230Z 2025-12-04T13:44:25.8688468Z [rank3]:[W1204 13:28:45.441650361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8688638Z [rank2]:[W1204 13:28:45.475199641 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8688829Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8689084Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8689260Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8689635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8689851Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8689957Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8690053Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8690152Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8690154Z 2025-12-04T13:44:25.8690386Z [rank2]:[W1204 13:28:45.476450963 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8690556Z [rank1]:[W1204 13:28:46.060630764 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8690734Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8690990Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8691154Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8691524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8691726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8691834Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8691931Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8692026Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8692029Z 2025-12-04T13:44:25.8692263Z [rank1]:[W1204 13:28:46.061908636 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8692434Z [rank3]:[W1204 13:28:46.441811134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8692609Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8692880Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8693042Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8693430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8693645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8693751Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8693845Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8693941Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8693944Z 2025-12-04T13:44:25.8694177Z [rank3]:[W1204 13:28:46.443825449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8694346Z [rank2]:[W1204 13:28:46.476591457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8694523Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8694777Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8694941Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8695314Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8695516Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8695621Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8695716Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8695814Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8695815Z 2025-12-04T13:44:25.8696047Z [rank2]:[W1204 13:28:46.478320658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8696219Z [rank1]:[W1204 13:28:47.062036600 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8696392Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8696660Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8696821Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8697201Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8697417Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8697571Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8697668Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8697763Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8697765Z 2025-12-04T13:44:25.8698000Z [rank1]:[W1204 13:28:47.063309232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8698175Z [rank3]:[W1204 13:28:47.443966843 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8698349Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8698606Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8698769Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8699135Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8699339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8699445Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8699542Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8699638Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8699640Z 2025-12-04T13:44:25.8699873Z [rank3]:[W1204 13:28:47.445981709 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8700045Z [rank2]:[W1204 13:28:47.478599859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8700220Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8700475Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8700654Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8701021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8701259Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8701364Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8701472Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8701568Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8701571Z 2025-12-04T13:44:25.8701805Z [rank2]:[W1204 13:28:47.481298780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8701978Z [rank1]:[W1204 13:28:48.063487645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8702153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8702408Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8702572Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8702936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8703139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8703242Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8703339Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8703435Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8703437Z 2025-12-04T13:44:25.8703673Z [rank1]:[W1204 13:28:48.065180437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8703846Z [rank3]:[W1204 13:28:48.446123453 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8704020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8704277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8704438Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8704815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8705025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8705141Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8705236Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8705346Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8705347Z 2025-12-04T13:44:25.8705581Z [rank3]:[W1204 13:28:48.448391133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8705750Z [rank2]:[W1204 13:28:48.481442064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8705927Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8706184Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8706348Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8706716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8706917Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8707021Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8707117Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8707213Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8707216Z 2025-12-04T13:44:25.8707448Z [rank2]:[W1204 13:28:48.483556777 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8707659Z [rank1]:[W1204 13:28:49.065351461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8707833Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8708091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8708256Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8708636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8708838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8708954Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8709062Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8709158Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8709173Z 2025-12-04T13:44:25.8709407Z [rank1]:[W1204 13:28:49.066942066 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8709578Z [rank3]:[W1204 13:28:49.448535877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8709753Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8710013Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8710175Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8710549Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8710750Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8710855Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8710951Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8711047Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8711049Z 2025-12-04T13:44:25.8711281Z [rank3]:[W1204 13:28:49.450357167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8711452Z [rank2]:[W1204 13:28:49.483683342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8711627Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8711883Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8712048Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8712430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8712634Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8712749Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8712844Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8712952Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8712954Z 2025-12-04T13:44:25.8713187Z [rank2]:[W1204 13:28:49.485710507 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8713372Z [rank1]:[W1204 13:28:50.067081971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8713548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8713802Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8713967Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8714332Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8714534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8714638Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8714735Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8714831Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8714833Z 2025-12-04T13:44:25.8715067Z [rank1]:[W1204 13:28:50.068853892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8715238Z [rank3]:[W1204 13:28:50.450514902 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8715412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8715668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8715830Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8716197Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8716408Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8716514Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8716610Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8716721Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8716723Z 2025-12-04T13:44:25.8716969Z [rank3]:[W1204 13:28:50.452473359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8717149Z [rank2]:[W1204 13:28:50.485781634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8717325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8717618Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8717784Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8718152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8718356Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8718461Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8718555Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8718653Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8718655Z 2025-12-04T13:44:25.8718888Z [rank2]:[W1204 13:28:50.488057343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8719060Z [rank1]:[W1204 13:28:51.069031726 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8719236Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8719497Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8719659Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8720027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8720230Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8720347Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8720442Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8720537Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8720551Z 2025-12-04T13:44:25.8720799Z [rank1]:[W1204 13:28:51.070833936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8720970Z [rank3]:[W1204 13:28:51.452623744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8721158Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8721420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8721582Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8721949Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8722151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8722255Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8722350Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8722446Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8722449Z 2025-12-04T13:44:25.8722683Z [rank3]:[W1204 13:28:51.454583370 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8722854Z [rank2]:[W1204 13:28:51.488197659 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8723030Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8723285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8723448Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8723819Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8724021Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8724126Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8724229Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8724327Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8724328Z 2025-12-04T13:44:25.8724560Z [rank2]:[W1204 13:28:51.490522248 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8724755Z [rank1]:[W1204 13:28:52.070966632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8724930Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8725195Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8725358Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8725731Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8725935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8726040Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8726137Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8726233Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8726237Z 2025-12-04T13:44:25.8726469Z [rank1]:[W1204 13:28:52.072210495 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8726642Z [rank3]:[W1204 13:28:52.454717376 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8726818Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8727077Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8727241Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8727647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8727852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8727957Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8728053Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8728163Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8728166Z 2025-12-04T13:44:25.8728400Z [rank3]:[W1204 13:28:52.456593845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8728585Z [rank2]:[W1204 13:28:52.490654933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8728773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8729028Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8729208Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8729575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8729777Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8729882Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8729977Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8730076Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8730079Z 2025-12-04T13:44:25.8730312Z [rank2]:[W1204 13:28:52.492716948 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8730485Z [rank1]:[W1204 13:28:53.072348291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8730662Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8730918Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8731083Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8731450Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8731652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8731757Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8731852Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8731948Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8731951Z 2025-12-04T13:44:25.8732202Z [rank1]:[W1204 13:28:53.073596553 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8732377Z [rank3]:[W1204 13:28:53.456735241 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8732562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8732827Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8733000Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8733367Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8733569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8733673Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8733769Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8733865Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8733867Z 2025-12-04T13:44:25.8734102Z [rank3]:[W1204 13:28:53.458548841 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8734273Z [rank2]:[W1204 13:28:53.492887093 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8734452Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8734710Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8734876Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8735246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8735448Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8735552Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8735648Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8735744Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8735747Z 2025-12-04T13:44:25.8735990Z [rank2]:[W1204 13:28:53.495184603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8736163Z [rank1]:[W1204 13:28:54.073751269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8736348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8736615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8736780Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8737157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8737360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8737464Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8737589Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8737684Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8737688Z 2025-12-04T13:44:25.8737920Z [rank1]:[W1204 13:28:54.075469431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8738091Z [rank3]:[W1204 13:28:54.458649138 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8738265Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8738524Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8738687Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8739061Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8739262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8739367Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8739463Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8739561Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8739563Z 2025-12-04T13:44:25.8739796Z [rank3]:[W1204 13:28:54.460693013 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8739982Z [rank2]:[W1204 13:28:54.495320199 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8740158Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8740427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8740604Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8740975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8741191Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8741296Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8741392Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8741490Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8741492Z 2025-12-04T13:44:25.8741722Z [rank2]:[W1204 13:28:54.497386704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8741895Z [rank1]:[W1204 13:28:55.075639407 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8742069Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8742323Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8742487Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8742854Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8743058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8743162Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8743259Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8743356Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8743358Z 2025-12-04T13:44:25.8743593Z [rank1]:[W1204 13:28:55.078052454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8743766Z [rank3]:[W1204 13:28:55.460825420 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8743951Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8744208Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8744381Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8744757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8744975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8745079Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8745174Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8745270Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8745274Z 2025-12-04T13:44:25.8745511Z [rank3]:[W1204 13:28:55.462686099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8745681Z [rank2]:[W1204 13:28:55.497551750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8745856Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8746109Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8746274Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8746642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8746845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8746949Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8747043Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8747140Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8747142Z 2025-12-04T13:44:25.8747374Z [rank2]:[W1204 13:28:55.499723082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8747584Z [rank1]:[W1204 13:28:56.078208251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8747761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8748027Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8748204Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8748581Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8748795Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8748899Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8748995Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8749091Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8749094Z 2025-12-04T13:44:25.8749326Z [rank1]:[W1204 13:28:56.080692796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8749496Z [rank3]:[W1204 13:28:56.462859076 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8749671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8749932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8750094Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8750464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8750666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8750771Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8750866Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8750961Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8750964Z 2025-12-04T13:44:25.8751196Z [rank3]:[W1204 13:28:56.464876711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8751368Z [rank2]:[W1204 13:28:56.499894629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8751544Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8751810Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8751974Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8752364Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8752566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8752682Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8752778Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8752876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8752877Z 2025-12-04T13:44:25.8753109Z [rank2]:[W1204 13:28:56.502029552 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8753280Z [rank1]:[W1204 13:28:57.080861053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8753455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8753711Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8753874Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8754243Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8754447Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8754552Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8754647Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8754744Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8754746Z 2025-12-04T13:44:25.8754978Z [rank1]:[W1204 13:28:57.083036165 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8755149Z [rank3]:[W1204 13:28:57.465029738 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8755324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8755579Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8755750Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8756116Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8756337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8756442Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8756549Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8756645Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8756647Z 2025-12-04T13:44:25.8756882Z [rank3]:[W1204 13:28:57.466995475 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8757052Z [rank2]:[W1204 13:28:57.502184879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8757230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8757520Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8757687Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8758053Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8758256Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8758362Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8758458Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8758555Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8758557Z 2025-12-04T13:44:25.8758790Z [rank2]:[W1204 13:28:57.504165475 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8758960Z [rank1]:[W1204 13:28:58.083213992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8759135Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8759388Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8759550Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8759927Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8760145Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8760267Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8760363Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8760472Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8760474Z 2025-12-04T13:44:25.8760709Z [rank1]:[W1204 13:28:58.085273406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8760878Z [rank3]:[W1204 13:28:58.467235971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8761054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8761313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8761476Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8761847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8762050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8762154Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8762250Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8762347Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8762348Z 2025-12-04T13:44:25.8762583Z [rank3]:[W1204 13:28:58.469425702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8762754Z [rank2]:[W1204 13:28:58.504314233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8762931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8763187Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8763351Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8763730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8763931Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8764058Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8764170Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8764268Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8764280Z 2025-12-04T13:44:25.8764514Z [rank2]:[W1204 13:28:58.505548006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8764684Z [rank1]:[W1204 13:28:59.085459834 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8764857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8765116Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8765280Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8765645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8765845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8765952Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8766047Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8766143Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8766146Z 2025-12-04T13:44:25.8766378Z [rank1]:[W1204 13:28:59.087850101 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8766549Z [rank3]:[W1204 13:28:59.469561961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8766723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8766980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8767143Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8767562Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8767763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8767879Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8767975Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8768083Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8768086Z 2025-12-04T13:44:25.8768318Z [rank3]:[W1204 13:28:59.471609336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8768504Z [rank2]:[W1204 13:28:59.505691744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8768680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8768936Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8769098Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8769470Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8769672Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8769776Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8769873Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8769970Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8769972Z 2025-12-04T13:44:25.8770205Z [rank2]:[W1204 13:28:59.506920207 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8770376Z [rank1]:[W1204 13:29:00.088107787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8770551Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8770804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8770968Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8771334Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8771545Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8771650Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8771755Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8771852Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8771854Z 2025-12-04T13:44:25.8772094Z [rank1]:[W1204 13:29:00.090732599 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8772276Z [rank3]:[W1204 13:29:00.471792753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8772451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8772707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8772871Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8773242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8773445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8773548Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8773645Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8773742Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8773744Z 2025-12-04T13:44:25.8773978Z [rank3]:[W1204 13:29:00.473483686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8774147Z [rank2]:[W1204 13:29:00.507053446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8774322Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8774577Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8774740Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8775106Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8775319Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8775425Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8775520Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8775635Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8775638Z 2025-12-04T13:44:25.8775885Z [rank2]:[W1204 13:29:00.508280898 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8776055Z [rank1]:[W1204 13:29:01.090865938 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8776240Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8776494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8776657Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8777021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8777223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8777328Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8777422Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8777559Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8777562Z 2025-12-04T13:44:25.8777796Z [rank1]:[W1204 13:29:01.093260295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8777967Z [rank3]:[W1204 13:29:01.473616085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8778143Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8778401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8778564Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8778932Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8779133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8779237Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8779346Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8779442Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8779456Z 2025-12-04T13:44:25.8779689Z [rank3]:[W1204 13:29:01.475058374 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8779870Z [rank2]:[W1204 13:29:01.508416148 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8780046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8780316Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8780479Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8780850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8781051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8781161Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8781257Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8781377Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8781379Z 2025-12-04T13:44:25.8781612Z [rank2]:[W1204 13:29:01.509811057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8781783Z [rank1]:[W1204 13:29:02.093443903 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8781961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8782217Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8782380Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8782754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8782959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8783065Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8783160Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8783267Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8783269Z 2025-12-04T13:44:25.8783501Z [rank1]:[W1204 13:29:02.095917199 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8783682Z [rank3]:[W1204 13:29:02.475180633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8783867Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8784131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8784295Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8784664Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8784872Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8784975Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8785072Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8785167Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8785169Z 2025-12-04T13:44:25.8785403Z [rank3]:[W1204 13:29:02.477119361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8785573Z [rank2]:[W1204 13:29:02.509944116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8785748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8786003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8786167Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8786535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8786740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8786845Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8786942Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8787040Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8787042Z 2025-12-04T13:44:25.8787285Z [rank2]:[W1204 13:29:02.511187159 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8787455Z [rank1]:[W1204 13:29:03.096091468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8787692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8787948Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8788125Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8788489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8788692Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8788799Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8788894Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8788993Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8788995Z 2025-12-04T13:44:25.8789233Z [rank1]:[W1204 13:29:03.098531724 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8789406Z [rank3]:[W1204 13:29:03.477288900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8789583Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8789840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8790003Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8790370Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8790573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8790679Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8790777Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8790873Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8790877Z 2025-12-04T13:44:25.8791129Z [rank3]:[W1204 13:29:03.479341764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8791301Z [rank2]:[W1204 13:29:03.511312269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8791488Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8791754Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8791916Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8792296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8792497Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8792604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8792699Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8792797Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8792800Z 2025-12-04T13:44:25.8793033Z [rank2]:[W1204 13:29:03.512522803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8793203Z [rank1]:[W1204 13:29:04.098697683 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8793377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8793636Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8793799Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8794168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8794370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8794476Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8794571Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8794667Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8794670Z 2025-12-04T13:44:25.8794902Z [rank1]:[W1204 13:29:04.101205438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8795084Z [rank3]:[W1204 13:29:04.479497214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8795258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8795525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8795697Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8796075Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8796277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8796383Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8796479Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8796575Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8796577Z 2025-12-04T13:44:25.8796810Z [rank3]:[W1204 13:29:04.481365843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8796982Z [rank2]:[W1204 13:29:04.512658403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8797156Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8797413Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8797619Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8797994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8798198Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8798303Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8798401Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8798497Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8798500Z 2025-12-04T13:44:25.8798734Z [rank2]:[W1204 13:29:04.513883686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8798918Z [rank1]:[W1204 13:29:05.101382118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8799094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8799347Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8799535Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8799902Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8800118Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8800224Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8800319Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8800415Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8800417Z 2025-12-04T13:44:25.8800653Z [rank1]:[W1204 13:29:05.103725356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8800824Z [rank3]:[W1204 13:29:05.481466654 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8800999Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8801254Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8801417Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8801781Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8801984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8802088Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8802184Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8802280Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8802282Z 2025-12-04T13:44:25.8802517Z [rank3]:[W1204 13:29:05.483450871 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8802688Z [rank2]:[W1204 13:29:05.514031296 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8802873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8803130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8803302Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8803680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8803897Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8804002Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8804097Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8804195Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8804196Z 2025-12-04T13:44:25.8804435Z [rank2]:[W1204 13:29:05.515360467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8804605Z [rank1]:[W1204 13:29:06.103910466 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8804781Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8805034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8805198Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8805566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8805767Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8805872Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8805967Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8806063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8806068Z 2025-12-04T13:44:25.8806301Z [rank1]:[W1204 13:29:06.106247354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8806475Z [rank3]:[W1204 13:29:06.483627241 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8806651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8806919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8807082Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8807467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8807692Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8807811Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8807908Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8808003Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8808005Z 2025-12-04T13:44:25.8808240Z [rank3]:[W1204 13:29:06.484876703 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8808411Z [rank2]:[W1204 13:29:06.515451289 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8808585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8808846Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8809010Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8809378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8809579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8809685Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8809783Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8809879Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8809881Z 2025-12-04T13:44:25.8810114Z [rank2]:[W1204 13:29:06.516832818 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8810285Z [rank1]:[W1204 13:29:07.106440964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8810462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8810729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8810893Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8811261Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8811489Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8811603Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8811698Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8811796Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8811798Z 2025-12-04T13:44:25.8812031Z [rank1]:[W1204 13:29:07.108827301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8812203Z [rank3]:[W1204 13:29:07.485044314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8812377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8812635Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8812797Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8813167Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8813372Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8813475Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8813573Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8813669Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8813671Z 2025-12-04T13:44:25.8813903Z [rank3]:[W1204 13:29:07.486433603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8814074Z [rank2]:[W1204 13:29:07.516986869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8814250Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8814506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8814680Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8815047Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8815274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8815380Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8815487Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8815583Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8815584Z 2025-12-04T13:44:25.8815818Z [rank2]:[W1204 13:29:07.518678942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8815988Z [rank1]:[W1204 13:29:08.108991512 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8816165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8816419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8816582Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8816949Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8817150Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8817255Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8817351Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8817449Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8817451Z 2025-12-04T13:44:25.8817720Z [rank1]:[W1204 13:29:08.110504019 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8817891Z [rank3]:[W1204 13:29:08.486609434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8818066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8818324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8818486Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8818868Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8819086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8819212Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8819308Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8819416Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8819419Z 2025-12-04T13:44:25.8819653Z [rank3]:[W1204 13:29:08.488072012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8819827Z [rank2]:[W1204 13:29:08.518830783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8820003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8820260Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8820422Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8820793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8820999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8821103Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8821201Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8821297Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8821300Z 2025-12-04T13:44:25.8821534Z [rank2]:[W1204 13:29:08.520076336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8821704Z [rank1]:[W1204 13:29:09.110653741 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8821881Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8822140Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8822303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8822680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8822881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8822999Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8823104Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8823200Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8823202Z 2025-12-04T13:44:25.8823446Z [rank1]:[W1204 13:29:09.111960882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8823618Z [rank3]:[W1204 13:29:09.488223513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8823794Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8824052Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8824218Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8824586Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8824787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8824892Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8824987Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8825083Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8825087Z 2025-12-04T13:44:25.8825319Z [rank3]:[W1204 13:29:09.490049053 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8825493Z [rank2]:[W1204 13:29:09.520240117 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8825667Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8825923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8826086Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8826457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8826672Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8826775Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8826883Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8826980Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8826991Z 2025-12-04T13:44:25.8827227Z [rank2]:[W1204 13:29:09.522151755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8827409Z [rank1]:[W1204 13:29:10.112072795 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8827603Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8827855Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8828021Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8828391Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8828598Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8828702Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8828797Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8828895Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8828896Z 2025-12-04T13:44:25.8829129Z [rank1]:[W1204 13:29:10.113308047 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8829301Z [rank3]:[W1204 13:29:10.490191656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8829478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8829733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8829898Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8830264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8830482Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8830587Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8830683Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8830792Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8830796Z 2025-12-04T13:44:25.8831043Z [rank3]:[W1204 13:29:10.491926347 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8831216Z [rank2]:[W1204 13:29:10.522488113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8831404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8831663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8831828Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8832194Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8832398Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8832502Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8832599Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8832696Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8832699Z 2025-12-04T13:44:25.8832941Z [rank2]:[W1204 13:29:10.524101818 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8833111Z [rank1]:[W1204 13:29:11.113450310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8833288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8833543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8833709Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8834077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8834278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8834392Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8834486Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8834582Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8834599Z 2025-12-04T13:44:25.8834832Z [rank1]:[W1204 13:29:11.115463336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8835014Z [rank3]:[W1204 13:29:11.492064820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8835199Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8835453Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8835616Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8835985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8836190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8836295Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8836394Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8836490Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8836493Z 2025-12-04T13:44:25.8836725Z [rank3]:[W1204 13:29:11.493982898 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8836897Z [rank2]:[W1204 13:29:11.524253360 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8837069Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8837330Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8837529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8837898Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8838100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8838205Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8838318Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8838414Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8838416Z 2025-12-04T13:44:25.8838650Z [rank2]:[W1204 13:29:11.525464393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8838847Z [rank1]:[W1204 13:29:12.115598429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8839024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8839295Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8839461Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8839834Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8840038Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8840144Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8840239Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8840337Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8840339Z 2025-12-04T13:44:25.8840571Z [rank1]:[W1204 13:29:12.117952697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8840744Z [rank3]:[W1204 13:29:12.494263598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8840920Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8841174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8841339Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8841709Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8841913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8842017Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8842113Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8842221Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8842223Z 2025-12-04T13:44:25.8842454Z [rank3]:[W1204 13:29:12.496476389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8842634Z [rank2]:[W1204 13:29:12.525624386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8842820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8843077Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8843251Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8843617Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8843821Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8843925Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8844022Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8844118Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8844119Z 2025-12-04T13:44:25.8844353Z [rank2]:[W1204 13:29:12.527728190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8844522Z [rank1]:[W1204 13:29:13.118111040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8844697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8844951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8845114Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8845481Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8845682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8845788Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8845882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8845980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8845982Z 2025-12-04T13:44:25.8846226Z [rank1]:[W1204 13:29:13.120392959 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8846398Z [rank3]:[W1204 13:29:13.496640862 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8846583Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8846847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8847020Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8847385Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8847622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8847727Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8847826Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8847923Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8847925Z 2025-12-04T13:44:25.8848160Z [rank3]:[W1204 13:29:13.498727936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8851935Z [rank2]:[W1204 13:29:13.527818454 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8852118Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8852377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8852540Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8852908Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8853111Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8853216Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8853313Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8853410Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8853412Z 2025-12-04T13:44:25.8853681Z [rank2]:[W1204 13:29:13.528971559 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8853851Z [rank1]:[W1204 13:29:14.120558972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8854027Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8854311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8854474Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8854858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8855059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8855166Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8855261Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8855358Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8855360Z 2025-12-04T13:44:25.8855594Z [rank1]:[W1204 13:29:14.122335653 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8855766Z [rank3]:[W1204 13:29:14.498889329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8855943Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8856200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8856366Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8856735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8856937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8857042Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8857138Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8857235Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8857237Z 2025-12-04T13:44:25.8857468Z [rank3]:[W1204 13:29:14.501193418 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8857696Z [rank2]:[W1204 13:29:14.529087133 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8857870Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8858125Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8858312Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8858680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8858900Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8859003Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8859100Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8859195Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8859198Z 2025-12-04T13:44:25.8859432Z [rank2]:[W1204 13:29:14.530797495 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8859603Z [rank1]:[W1204 13:29:15.122497787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8859780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8860034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8860196Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8860563Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8860768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8860872Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8860967Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8861063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8861065Z 2025-12-04T13:44:25.8861299Z [rank1]:[W1204 13:29:15.124575491 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8861470Z [rank3]:[W1204 13:29:15.501375961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8861655Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8861911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8862084Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8862459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8862674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8862779Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8862874Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8862972Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8862974Z 2025-12-04T13:44:25.8863207Z [rank3]:[W1204 13:29:15.502852049 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8863379Z [rank2]:[W1204 13:29:15.530941289 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8863555Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8863811Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8863975Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8864343Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8864546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8864650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8864746Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8864842Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8864845Z 2025-12-04T13:44:25.8865080Z [rank2]:[W1204 13:29:15.532389757 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8865250Z [rank1]:[W1204 13:29:16.124758984 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8865426Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8865692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8865854Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8866241Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8866458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8866564Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8866659Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8866755Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8866758Z 2025-12-04T13:44:25.8866991Z [rank1]:[W1204 13:29:16.127067673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8867162Z [rank3]:[W1204 13:29:16.503015633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8867339Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8867634Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8867797Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8868166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8868369Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8868474Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8868570Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8868669Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8868670Z 2025-12-04T13:44:25.8868905Z [rank3]:[W1204 13:29:16.505324852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8869076Z [rank2]:[W1204 13:29:16.532521832 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8869250Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8869523Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8869686Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8870070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8870283Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8870401Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8870498Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8870595Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8870598Z 2025-12-04T13:44:25.8870833Z [rank2]:[W1204 13:29:16.533887782 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8871004Z [rank1]:[W1204 13:29:17.127209388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8871178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8871433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8871597Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8871966Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8872168Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8872273Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8872368Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8872465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8872467Z 2025-12-04T13:44:25.8872700Z [rank1]:[W1204 13:29:17.129300042 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8872871Z [rank3]:[W1204 13:29:17.505482586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8873046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8873299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8873474Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8873845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8874066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8874172Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8874277Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8874374Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8874376Z 2025-12-04T13:44:25.8874607Z [rank3]:[W1204 13:29:17.507747216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8874778Z [rank2]:[W1204 13:29:17.534046816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8874953Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8875207Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8875372Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8875740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8875944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8876049Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8876146Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8876243Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8876245Z 2025-12-04T13:44:25.8876479Z [rank2]:[W1204 13:29:17.535883336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8876650Z [rank1]:[W1204 13:29:18.129486436 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8876826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8877081Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8877243Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8877655Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8877869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8877987Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8878082Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8878193Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8878195Z 2025-12-04T13:44:25.8878430Z [rank1]:[W1204 13:29:18.132216576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8878599Z [rank3]:[W1204 13:29:18.507869812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8878774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8879029Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8879193Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8879559Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8879762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8879867Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8879963Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8880060Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8880063Z 2025-12-04T13:44:25.8880297Z [rank3]:[W1204 13:29:18.509622023 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8880471Z [rank2]:[W1204 13:29:18.536035191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8880647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8880903Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8881065Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8881457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8881659Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8881775Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8881886Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8881982Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8881995Z 2025-12-04T13:44:25.8882229Z [rank2]:[W1204 13:29:18.537285383 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8882399Z [rank1]:[W1204 13:29:19.132363371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8882575Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8882838Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8883000Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8883367Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8883569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8883674Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8883769Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8883866Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8883868Z 2025-12-04T13:44:25.8884102Z [rank1]:[W1204 13:29:19.135010863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8884272Z [rank3]:[W1204 13:29:19.509734499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8884449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8884708Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8884872Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8885247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8885449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8885564Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8885659Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8885765Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8885767Z 2025-12-04T13:44:25.8885999Z [rank3]:[W1204 13:29:19.511979920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8886182Z [rank2]:[W1204 13:29:19.537452688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8886356Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8886611Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8886776Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8887142Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8887346Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8887451Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8887584Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8887680Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8887682Z 2025-12-04T13:44:25.8887915Z [rank2]:[W1204 13:29:19.539577262 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8888086Z [rank1]:[W1204 13:29:20.135159858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8888262Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8888518Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8888680Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8889049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8889268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8889374Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8889468Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8889578Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8889580Z 2025-12-04T13:44:25.8889828Z [rank1]:[W1204 13:29:20.136372651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8890014Z [rank3]:[W1204 13:29:20.512181984 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8890190Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8890444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8890609Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8890974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8891182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8891287Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8891382Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8891479Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8891481Z 2025-12-04T13:44:25.8891713Z [rank3]:[W1204 13:29:20.513671662 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8891883Z [rank2]:[W1204 13:29:20.539714897 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8892058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8892314Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8892478Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8892846Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8893049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8893166Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8893261Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8893357Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8893370Z 2025-12-04T13:44:25.8893617Z [rank2]:[W1204 13:29:20.541049728 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8893789Z [rank1]:[W1204 13:29:21.136516737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8893972Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8894228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8894390Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8894757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8894959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8895063Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8895157Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8895254Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8895256Z 2025-12-04T13:44:25.8895489Z [rank1]:[W1204 13:29:21.137795029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8895660Z [rank3]:[W1204 13:29:21.513839817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8895835Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8896090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8896252Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8896620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8896823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8896928Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8897032Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8897129Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8897130Z 2025-12-04T13:44:25.8897362Z [rank3]:[W1204 13:29:21.515444892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8897601Z [rank2]:[W1204 13:29:21.541222164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8897777Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8898053Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8898216Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8898582Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8898785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8898890Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8898986Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8899082Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8899084Z 2025-12-04T13:44:25.8899315Z [rank2]:[W1204 13:29:21.542659832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8899488Z [rank1]:[W1204 13:29:22.137896707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8899663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8899919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8900085Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8900453Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8900656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8900761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8900855Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8902045Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8902048Z 2025-12-04T13:44:25.8902285Z [rank1]:[W1204 13:29:22.139154109 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8902474Z [rank3]:[W1204 13:29:22.515628298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8902652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8902908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8903101Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8903470Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8903673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8903777Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8903874Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8903969Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8903972Z 2025-12-04T13:44:25.8904205Z [rank3]:[W1204 13:29:22.517914237 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8904378Z [rank2]:[W1204 13:29:22.542787439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8904551Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8904804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8904970Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8905337Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8905542Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8905646Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8905742Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8905840Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8905842Z 2025-12-04T13:44:25.8906131Z [rank2]:[W1204 13:29:22.544503961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8906302Z [rank1]:[W1204 13:29:23.139527551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8906487Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8906743Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8906922Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8907290Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8907544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8907648Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8907743Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8907842Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8907844Z 2025-12-04T13:44:25.8908078Z [rank1]:[W1204 13:29:23.141124285 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8908248Z [rank3]:[W1204 13:29:23.518033215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8908424Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8908679Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8908842Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8909208Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8909413Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8909516Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8909612Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8909708Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8909711Z 2025-12-04T13:44:25.8909961Z [rank3]:[W1204 13:29:23.520243596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8910151Z [rank2]:[W1204 13:29:23.544628128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8910338Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8910593Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8910755Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8911138Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8911340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8911445Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8911541Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8911636Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8911639Z 2025-12-04T13:44:25.8911872Z [rank2]:[W1204 13:29:23.546249903 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8912044Z [rank1]:[W1204 13:29:24.141294882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8912221Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8912476Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8912641Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8913011Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8913211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8913317Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8913411Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8913508Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8913510Z 2025-12-04T13:44:25.8913743Z [rank1]:[W1204 13:29:24.143692859 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8913943Z [rank3]:[W1204 13:29:24.520420622 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8914119Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8914386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8914550Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8914918Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8915133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8915237Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8915333Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8915430Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8915432Z 2025-12-04T13:44:25.8915665Z [rank3]:[W1204 13:29:24.522744781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8915838Z [rank2]:[W1204 13:29:24.546384230 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8916012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8916266Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8916430Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8916803Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8917007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8917111Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8917208Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8917305Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8917307Z 2025-12-04T13:44:25.8917578Z [rank2]:[W1204 13:29:24.548801577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8917750Z [rank1]:[W1204 13:29:25.143823097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8917954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8918212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8918386Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8918754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8918969Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8919076Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8919172Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8919270Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8919272Z 2025-12-04T13:44:25.8919505Z [rank1]:[W1204 13:29:25.145086529 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8919676Z [rank3]:[W1204 13:29:25.522872459 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8919852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8920110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8920273Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8920638Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8920842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8920967Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8921128Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8921296Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8921298Z 2025-12-04T13:44:25.8921559Z [rank3]:[W1204 13:29:25.525291676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8921731Z [rank2]:[W1204 13:29:25.548939695 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8921906Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8922190Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8922363Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8922729Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8922944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8923050Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8923146Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8923241Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8923244Z 2025-12-04T13:44:25.8923479Z [rank2]:[W1204 13:29:25.551034398 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8923649Z [rank1]:[W1204 13:29:26.145213067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8923824Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8924079Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8924240Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8924608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8924809Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8924914Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8925011Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8925106Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8925109Z 2025-12-04T13:44:25.8925344Z [rank1]:[W1204 13:29:26.146468430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8925514Z [rank3]:[W1204 13:29:26.525656909 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8925691Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8925968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8926131Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8926507Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8926707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8926828Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8926925Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8927020Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8927022Z 2025-12-04T13:44:25.8927253Z [rank3]:[W1204 13:29:26.528034886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8927426Z [rank2]:[W1204 13:29:26.551179876 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8927638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8927899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8928063Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8928429Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8928661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8928813Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8928944Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8929091Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8929093Z 2025-12-04T13:44:25.8929327Z [rank2]:[W1204 13:29:26.553509195 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8929498Z [rank1]:[W1204 13:29:27.146649847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8929674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8929933Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8930133Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8930500Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8930718Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8930826Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8930935Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8931033Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8931035Z 2025-12-04T13:44:25.8931269Z [rank1]:[W1204 13:29:27.148448177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8931441Z [rank3]:[W1204 13:29:27.528171575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8931615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8931894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8932059Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8932427Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8932629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8932732Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8932829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8932924Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8932927Z 2025-12-04T13:44:25.8933162Z [rank3]:[W1204 13:29:27.529412228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8933331Z [rank2]:[W1204 13:29:27.553678413 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8933506Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8933762Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8933937Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8934318Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8934533Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8934639Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8934734Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8934841Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8934842Z 2025-12-04T13:44:25.8935077Z [rank2]:[W1204 13:29:27.555632290 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8935247Z [rank1]:[W1204 13:29:28.148593406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8935424Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8935676Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8935840Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8936222Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8936427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8936532Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8936628Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8936727Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8936729Z 2025-12-04T13:44:25.8936970Z [rank1]:[W1204 13:29:28.149862528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8937146Z [rank3]:[W1204 13:29:28.529551656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8937321Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8937614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8937778Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8938171Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8938373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8938488Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8938584Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8938678Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8938692Z 2025-12-04T13:44:25.8938930Z [rank3]:[W1204 13:29:28.531732628 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8939101Z [rank2]:[W1204 13:29:28.555771778 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8939276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8939531Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8939693Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8940065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8940268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8940374Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8940469Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8940567Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8940570Z 2025-12-04T13:44:25.8940803Z [rank2]:[W1204 13:29:28.558292553 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8940975Z [rank1]:[W1204 13:29:29.150038196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8941151Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8941406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8941569Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8941958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8942162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8942302Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8942397Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8942494Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8942496Z 2025-12-04T13:44:25.8942728Z [rank1]:[W1204 13:29:29.152318826 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8942912Z [rank3]:[W1204 13:29:29.531864628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8943085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8943348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8943612Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8944023Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8944227Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8944331Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8944428Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8944524Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8944526Z 2025-12-04T13:44:25.8944764Z [rank3]:[W1204 13:29:29.533906073 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8944935Z [rank2]:[W1204 13:29:29.558417742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8945110Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8945365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8945529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8945905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8946130Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8946236Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8946344Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8946440Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8946441Z 2025-12-04T13:44:25.8946675Z [rank2]:[W1204 13:29:29.560892018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8946859Z [rank1]:[W1204 13:29:30.152475135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8947036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8947304Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8947468Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8947886Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8948091Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8948196Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8948290Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8948387Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8948390Z 2025-12-04T13:44:25.8948654Z [rank1]:[W1204 13:29:30.154344464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8948825Z [rank3]:[W1204 13:29:30.534071672 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8948999Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8949264Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8949427Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8949798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8950016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8950135Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8950231Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8950338Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8950340Z 2025-12-04T13:44:25.8950574Z [rank3]:[W1204 13:29:30.536300432 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8950746Z [rank2]:[W1204 13:29:30.561070426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8950939Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8951196Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8951358Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8951726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8951927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8952035Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8952132Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8952232Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8952235Z 2025-12-04T13:44:25.8952470Z [rank2]:[W1204 13:29:30.563175080 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8952667Z [rank1]:[W1204 13:29:31.154524022 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8952842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8953096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8953258Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8953647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8953849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8953965Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8954093Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8954190Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8954203Z 2025-12-04T13:44:25.8954435Z [rank1]:[W1204 13:29:31.156592487 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8954607Z [rank3]:[W1204 13:29:31.536480421 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8954781Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8955048Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8955211Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8955580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8955782Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8955888Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8955985Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8956081Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8956083Z 2025-12-04T13:44:25.8956336Z [rank3]:[W1204 13:29:31.538629624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8956508Z [rank2]:[W1204 13:29:31.563352829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8956681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8956942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8957105Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8957518Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8957720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8957825Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8957921Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8958065Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8958067Z 2025-12-04T13:44:25.8958301Z [rank2]:[W1204 13:29:31.565391714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8958488Z [rank1]:[W1204 13:29:32.156784606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8958666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8958939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8959105Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8959469Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8959685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8959790Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8959885Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8959982Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8959985Z 2025-12-04T13:44:25.8960216Z [rank1]:[W1204 13:29:32.159149053 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8960387Z [rank3]:[W1204 13:29:32.538820403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8960560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8960815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8960980Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8961349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8961551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8961654Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8961750Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8961845Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8961848Z 2025-12-04T13:44:25.8962104Z [rank3]:[W1204 13:29:32.540816479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8962287Z [rank2]:[W1204 13:29:32.565519364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8962461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8962717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8962892Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8963262Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8963466Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8963571Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8963668Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8963765Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8963767Z 2025-12-04T13:44:25.8964002Z [rank2]:[W1204 13:29:32.567717846 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8964171Z [rank1]:[W1204 13:29:33.159297954 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8964346Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8964601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8964764Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8965131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8965332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8965437Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8965532Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8965630Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8965632Z 2025-12-04T13:44:25.8965887Z [rank1]:[W1204 13:29:33.161237701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8966057Z [rank3]:[W1204 13:29:33.540975429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8966242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8966496Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8966658Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8967039Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8967241Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8967346Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8967443Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8967588Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8967592Z 2025-12-04T13:44:25.8967827Z [rank3]:[W1204 13:29:33.543141831 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8967999Z [rank2]:[W1204 13:29:33.567856446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8968172Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8968428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8968591Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8968959Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8969161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8969267Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8969362Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8969459Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8969461Z 2025-12-04T13:44:25.8969695Z [rank2]:[W1204 13:29:33.570103477 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8969900Z [rank1]:[W1204 13:29:34.161414341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8970077Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8970346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8970509Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8970897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8971096Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8971203Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8971297Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8971394Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8971396Z 2025-12-04T13:44:25.8971628Z [rank1]:[W1204 13:29:34.163257240 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8971801Z [rank3]:[W1204 13:29:34.543334091 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8971976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8972240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8972405Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8972772Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8972975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8973079Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8973176Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8973271Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8973273Z 2025-12-04T13:44:25.8973505Z [rank3]:[W1204 13:29:34.545024044 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8973691Z [rank2]:[W1204 13:29:34.570253137 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8973877Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8974132Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8974307Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8974675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8974893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8974998Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8975097Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8975193Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8975195Z 2025-12-04T13:44:25.8975427Z [rank2]:[W1204 13:29:34.571553589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8975597Z [rank1]:[W1204 13:29:35.163427081 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8975779Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8976032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8976195Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8976564Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8976769Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8976874Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8976969Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8977066Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8977068Z 2025-12-04T13:44:25.8977302Z [rank1]:[W1204 13:29:35.165829318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8977589Z [rank3]:[W1204 13:29:35.545353761 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8977785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8978053Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8978231Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8978596Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8978819Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8978925Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8979021Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8979118Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8979120Z 2025-12-04T13:44:25.8979352Z [rank3]:[W1204 13:29:35.547803847 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8979522Z [rank2]:[W1204 13:29:35.571716749 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8979697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8979953Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8980115Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8980481Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8980684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8980792Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8980888Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8980985Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8980988Z 2025-12-04T13:44:25.8981226Z [rank2]:[W1204 13:29:35.574007709 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8981395Z [rank1]:[W1204 13:29:36.165992809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8981572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8981852Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8982016Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8982396Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8982597Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8982713Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8982809Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8982906Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8982907Z 2025-12-04T13:44:25.8983145Z [rank1]:[W1204 13:29:36.167639642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8983316Z [rank3]:[W1204 13:29:36.547998267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8983491Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8983749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8983912Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8984278Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8984479Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8984584Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8984680Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8984777Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8984779Z 2025-12-04T13:44:25.8985011Z [rank3]:[W1204 13:29:36.549647021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8985183Z [rank2]:[W1204 13:29:36.574153190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8985360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8985628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8985803Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8986170Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8986383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8986504Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8986600Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8986697Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8986699Z 2025-12-04T13:44:25.8986931Z [rank2]:[W1204 13:29:36.576507328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8987101Z [rank1]:[W1204 13:29:37.167784374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8987276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8987582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8987747Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8988115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8988316Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8988421Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8988516Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8988613Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8988615Z 2025-12-04T13:44:25.8988848Z [rank1]:[W1204 13:29:37.169036196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8989020Z [rank3]:[W1204 13:29:37.549830282 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8989194Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8989449Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8989639Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8990022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8990240Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8990343Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8990452Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8990548Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8990550Z 2025-12-04T13:44:25.8990784Z [rank3]:[W1204 13:29:37.551993844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8990955Z [rank2]:[W1204 13:29:37.576601921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8991129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8991388Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8991551Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8991923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8992127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8992232Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8992329Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8992425Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8992427Z 2025-12-04T13:44:25.8992663Z [rank2]:[W1204 13:29:37.579120006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8992832Z [rank1]:[W1204 13:29:38.169222727 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8993008Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8993263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8993429Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8993821Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8994032Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8994138Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8994234Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8994342Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8994343Z 2025-12-04T13:44:25.8994577Z [rank1]:[W1204 13:29:38.171154795 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8994747Z [rank3]:[W1204 13:29:38.552237134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8994924Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8995178Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8995341Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8995708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8995909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8996015Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8996110Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8996206Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8996209Z 2025-12-04T13:44:25.8996442Z [rank3]:[W1204 13:29:38.554505354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8996613Z [rank2]:[W1204 13:29:38.579258728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8996788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8997043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8997205Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8997659Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8997862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8997979Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8998077Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8998172Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.8998174Z 2025-12-04T13:44:25.8998419Z [rank2]:[W1204 13:29:38.581799832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.8998592Z [rank1]:[W1204 13:29:39.171304057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.8998768Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.8999024Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.8999185Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8999554Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.8999756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.8999864Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.8999957Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9000054Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9000056Z 2025-12-04T13:44:25.9000290Z [rank1]:[W1204 13:29:39.172562389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9000463Z [rank3]:[W1204 13:29:39.554653736 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9000638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9000900Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9001063Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9001439Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9001652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9001757Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9001868Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9001964Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9001966Z 2025-12-04T13:44:25.9002198Z [rank3]:[W1204 13:29:39.557049543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9002379Z [rank2]:[W1204 13:29:39.581937315 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9002553Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9002808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9002973Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9003341Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9003547Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9003651Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9003750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9003845Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9003847Z 2025-12-04T13:44:25.9004082Z [rank2]:[W1204 13:29:39.584263104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9004252Z [rank1]:[W1204 13:29:40.172712622 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9004429Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9004684Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9004847Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9005219Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9005440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9005546Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9005640Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9005749Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9005751Z 2025-12-04T13:44:25.9005985Z [rank1]:[W1204 13:29:40.174210419 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9006154Z [rank3]:[W1204 13:29:40.557231925 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9006342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9006597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9006760Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9007126Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9007331Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9007439Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9007564Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9007663Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9007664Z 2025-12-04T13:44:25.9007897Z [rank3]:[W1204 13:29:40.559468436 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9008068Z [rank2]:[W1204 13:29:40.584404966 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9008242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9008499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9008662Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9009031Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9009234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9009353Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9009464Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9009561Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9010115Z 2025-12-04T13:44:25.9010349Z [rank2]:[W1204 13:29:40.586762564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9010518Z [rank1]:[W1204 13:29:41.174356952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9010711Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9010967Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9011129Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9011495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9011696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9011802Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9011898Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9011997Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9011999Z 2025-12-04T13:44:25.9012237Z [rank1]:[W1204 13:29:41.176442866 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9012409Z [rank3]:[W1204 13:29:41.559656438 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9012584Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9012840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9013003Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9013369Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9013571Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9013677Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9013784Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9013894Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9013896Z 2025-12-04T13:44:25.9014127Z [rank3]:[W1204 13:29:41.561757212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9014309Z [rank2]:[W1204 13:29:41.586896468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9014484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9014750Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9014912Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9015278Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9015481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9015585Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9015681Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9015778Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9015780Z 2025-12-04T13:44:25.9016013Z [rank2]:[W1204 13:29:41.589192177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9016183Z [rank1]:[W1204 13:29:42.176543810 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9016357Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9016613Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9016778Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9017143Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9017345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9017451Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9017584Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9017695Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9017697Z 2025-12-04T13:44:25.9017951Z [rank1]:[W1204 13:29:42.178943007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9018136Z [rank3]:[W1204 13:29:42.561923355 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9018312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9018567Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9018746Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9019115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9019319Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9019424Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9019521Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9019618Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9019620Z 2025-12-04T13:44:25.9019857Z [rank3]:[W1204 13:29:42.563895442 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9020028Z [rank2]:[W1204 13:29:42.589338061 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9020202Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9020458Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9020622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9020990Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9021194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9021298Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9021393Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9021490Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9021491Z 2025-12-04T13:44:25.9021748Z [rank2]:[W1204 13:29:42.591589661 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9021918Z [rank1]:[W1204 13:29:43.179058521 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9022105Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9022358Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9022531Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9022902Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9023104Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9023210Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9023304Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9023402Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9023404Z 2025-12-04T13:44:25.9023639Z [rank1]:[W1204 13:29:43.181435869 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9023809Z [rank3]:[W1204 13:29:43.564053435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9023984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9024237Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9024399Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9024766Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9024968Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9025073Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9025167Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9025265Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9025269Z 2025-12-04T13:44:25.9025514Z [rank3]:[W1204 13:29:43.565825356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9025697Z [rank2]:[W1204 13:29:43.591727885 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9025871Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9026137Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9026300Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9026679Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9026883Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9026987Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9027083Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9027179Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9027181Z 2025-12-04T13:44:25.9027424Z [rank2]:[W1204 13:29:43.594176701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9027640Z [rank1]:[W1204 13:29:44.181590653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9027813Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9028069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9028231Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9028598Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9028799Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9028905Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9029000Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9029096Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9029097Z 2025-12-04T13:44:25.9029329Z [rank1]:[W1204 13:29:44.183231857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9029530Z [rank3]:[W1204 13:29:44.565971220 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9029709Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9029969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9030144Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9030509Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9030723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9030829Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9030925Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9031022Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9031023Z 2025-12-04T13:44:25.9031255Z [rank3]:[W1204 13:29:44.567962146 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9031425Z [rank2]:[W1204 13:29:44.594309565 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9031600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9031858Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9032023Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9032390Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9032594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9032698Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9032796Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9032892Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9032894Z 2025-12-04T13:44:25.9033127Z [rank2]:[W1204 13:29:44.596667623 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9033299Z [rank1]:[W1204 13:29:45.183383961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9033500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9033757Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9033928Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9034302Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9034514Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9034621Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9034718Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9034814Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9034816Z 2025-12-04T13:44:25.9035049Z [rank1]:[W1204 13:29:45.184649573 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9035218Z [rank3]:[W1204 13:29:45.568120620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9035396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9035651Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9035815Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9036182Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9036385Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9036491Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9036587Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9036684Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9036687Z 2025-12-04T13:44:25.9036918Z [rank3]:[W1204 13:29:45.569401912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9037089Z [rank2]:[W1204 13:29:45.596819118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9037264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9037584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9037748Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9038129Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9038348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9038453Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9038550Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9038646Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9038649Z 2025-12-04T13:44:25.9038885Z [rank2]:[W1204 13:29:45.598327095 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9039055Z [rank1]:[W1204 13:29:46.184803278 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9039231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9039488Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9039650Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9040016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9040217Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9040322Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9040419Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9040514Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9040516Z 2025-12-04T13:44:25.9040751Z [rank1]:[W1204 13:29:46.187019179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9040918Z [rank3]:[W1204 13:29:46.569557397 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9041095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9041371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9041535Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9041900Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9042112Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9042230Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9042326Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9042424Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9042426Z 2025-12-04T13:44:25.9042657Z [rank3]:[W1204 13:29:46.571241140 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9042831Z [rank2]:[W1204 13:29:46.598437390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9043007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9043265Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9043430Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9043797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9044001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9044106Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9044203Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9044300Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9044302Z 2025-12-04T13:44:25.9044535Z [rank2]:[W1204 13:29:46.600019145 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9044710Z [rank1]:[W1204 13:29:47.187151974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9044884Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9045142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9045325Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9045692Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9045903Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9046008Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9046121Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9046217Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9046219Z 2025-12-04T13:44:25.9046454Z [rank1]:[W1204 13:29:47.188586552 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9046625Z [rank3]:[W1204 13:29:47.571390545 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9046802Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9047056Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9047222Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9047628Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9047830Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9047936Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9048031Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9048127Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9048128Z 2025-12-04T13:44:25.9048362Z [rank3]:[W1204 13:29:47.572715446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9048534Z [rank2]:[W1204 13:29:47.600164501 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9048708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9048962Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9049128Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9049526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9049744Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9049848Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9049943Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9050051Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9050053Z 2025-12-04T13:44:25.9050291Z [rank2]:[W1204 13:29:47.602793893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9050460Z [rank1]:[W1204 13:29:48.188767447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9050635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9050890Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9051053Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9051422Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9051624Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9051730Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9051826Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9051921Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9051924Z 2025-12-04T13:44:25.9052157Z [rank1]:[W1204 13:29:48.190262914 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9052327Z [rank3]:[W1204 13:29:48.572854241 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9052502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9052755Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9052919Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9053306Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9053507Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9053622Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9053717Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9053813Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9053827Z 2025-12-04T13:44:25.9054061Z [rank3]:[W1204 13:29:48.574104514 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9054233Z [rank2]:[W1204 13:29:48.602932878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9054407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9054663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9054827Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9055196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9055398Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9055503Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9055599Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9055695Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9055700Z 2025-12-04T13:44:25.9055935Z [rank2]:[W1204 13:29:48.605163119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9056108Z [rank1]:[W1204 13:29:49.190427779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9056282Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9056538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9056699Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9057077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9057290Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9057405Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9057547Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9057643Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9057645Z 2025-12-04T13:44:25.9057879Z [rank1]:[W1204 13:29:49.192463964 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9058064Z [rank3]:[W1204 13:29:49.574282649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9058241Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9058498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9058662Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9059027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9059232Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9059337Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9059432Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9059528Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9059530Z 2025-12-04T13:44:25.9059762Z [rank3]:[W1204 13:29:49.575539141 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9059934Z [rank2]:[W1204 13:29:49.605285615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9060111Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9060366Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9060532Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9060896Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9061125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9061230Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9061326Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9061442Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9061447Z 2025-12-04T13:44:25.9061679Z [rank2]:[W1204 13:29:49.607509436 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9061861Z [rank1]:[W1204 13:29:50.192571231 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9062036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9062292Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9062456Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9062824Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9063028Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9063132Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9063228Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9063324Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9063326Z 2025-12-04T13:44:25.9063559Z [rank1]:[W1204 13:29:50.193744695 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9063728Z [rank3]:[W1204 13:29:50.575686737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9063908Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9064164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9064329Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9064702Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9064906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9065031Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9065126Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9065223Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9065235Z 2025-12-04T13:44:25.9065468Z [rank3]:[W1204 13:29:50.577901308 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9065638Z [rank2]:[W1204 13:29:50.607672212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9065822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9066076Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9066241Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9066608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9066812Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9066917Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9067016Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9067113Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9067117Z 2025-12-04T13:44:25.9067350Z [rank2]:[W1204 13:29:50.609844454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9067559Z [rank1]:[W1204 13:29:51.193939561 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9067734Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9067991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9068152Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9068521Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9068723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9068828Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9068952Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9069048Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9069050Z 2025-12-04T13:44:25.9069285Z [rank1]:[W1204 13:29:51.196845667 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9069467Z [rank3]:[W1204 13:29:51.578038235 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9069643Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9069910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9070075Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9070439Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9070640Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9070746Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9070842Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9070939Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9070942Z 2025-12-04T13:44:25.9071174Z [rank3]:[W1204 13:29:51.580050161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9071347Z [rank2]:[W1204 13:29:51.609979381 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9071524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9071778Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9071943Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9072308Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9072512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9072619Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9072716Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9072840Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9072842Z 2025-12-04T13:44:25.9073076Z [rank2]:[W1204 13:29:51.612791459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9073258Z [rank1]:[W1204 13:29:52.196966754 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9073432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9073690Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9073865Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9074232Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9074434Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9074538Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9074635Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9074729Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9074732Z 2025-12-04T13:44:25.9074966Z [rank1]:[W1204 13:29:52.198968000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9075137Z [rank3]:[W1204 13:29:52.580202148 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9075312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9075566Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9075731Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9076097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9076297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9076402Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9076497Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9076595Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9076597Z 2025-12-04T13:44:25.9076858Z [rank3]:[W1204 13:29:52.582456218 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9077029Z [rank2]:[W1204 13:29:52.612928836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9077214Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9077468Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9077681Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9078053Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9078257Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9078361Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9078458Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9078556Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9078558Z 2025-12-04T13:44:25.9078792Z [rank2]:[W1204 13:29:52.615408802 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9078962Z [rank1]:[W1204 13:29:53.199108077 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9079136Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9079390Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9079553Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9079921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9080125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9080229Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9080326Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9080422Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9080425Z 2025-12-04T13:44:25.9080672Z [rank1]:[W1204 13:29:53.201017345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9080857Z [rank3]:[W1204 13:29:53.582621565 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9081046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9081304Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9081468Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9081847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9082048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9082154Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9082250Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9082347Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9082350Z 2025-12-04T13:44:25.9082585Z [rank3]:[W1204 13:29:53.584882435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9082756Z [rank2]:[W1204 13:29:53.615544569 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9082931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9083185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9083350Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9083718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9083921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9084025Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9084122Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9084219Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9084221Z 2025-12-04T13:44:25.9084456Z [rank2]:[W1204 13:29:53.617530605 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9084651Z [rank1]:[W1204 13:29:54.201114053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9084826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9085091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9085252Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9085618Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9085830Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9085934Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9086032Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9086128Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9086131Z 2025-12-04T13:44:25.9086363Z [rank1]:[W1204 13:29:54.202982352 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9086535Z [rank3]:[W1204 13:29:54.585016342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9086712Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9086968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9087131Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9087535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9087737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9087843Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9087939Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9088035Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9088037Z 2025-12-04T13:44:25.9088270Z [rank3]:[W1204 13:29:54.586438921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9088443Z [rank2]:[W1204 13:29:54.617659593 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9088645Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9088903Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9089081Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9089445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9089663Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9089768Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9089864Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9089963Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9089964Z 2025-12-04T13:44:25.9090197Z [rank2]:[W1204 13:29:54.619174400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9090368Z [rank1]:[W1204 13:29:55.203087801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9090543Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9090799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9090963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9091335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9091538Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9091643Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9091739Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9091835Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9091837Z 2025-12-04T13:44:25.9092070Z [rank1]:[W1204 13:29:55.204369102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9092238Z [rank3]:[W1204 13:29:55.586616818 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9092414Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9092698Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9092871Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9093238Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9093451Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9093558Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9093655Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9093751Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9093754Z 2025-12-04T13:44:25.9093988Z [rank3]:[W1204 13:29:55.588400679 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9094157Z [rank2]:[W1204 13:29:55.619307018 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9094333Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9094589Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9094752Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9095118Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9095321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9095428Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9095525Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9095623Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9095626Z 2025-12-04T13:44:25.9095862Z [rank2]:[W1204 13:29:55.620520561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9096034Z [rank1]:[W1204 13:29:56.204523550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9096209Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9096487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9096649Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9097026Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9097228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9097343Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9097443Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9097570Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9097572Z 2025-12-04T13:44:25.9097804Z [rank1]:[W1204 13:29:56.205846341 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9097974Z [rank3]:[W1204 13:29:56.588554637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9098153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9098410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9098572Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9098942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9099143Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9099250Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9099345Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9099444Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9099446Z 2025-12-04T13:44:25.9099683Z [rank3]:[W1204 13:29:56.590529053 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9099854Z [rank2]:[W1204 13:29:56.620652980 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9105303Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9105573Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9105798Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9106168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9106388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9106517Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9106615Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9106715Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9106718Z 2025-12-04T13:44:25.9106955Z [rank2]:[W1204 13:29:56.623103995 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9107131Z [rank1]:[W1204 13:29:57.205957050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9107311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9107616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9107784Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9108149Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9108353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9108458Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9108556Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9108652Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9108655Z 2025-12-04T13:44:25.9108891Z [rank1]:[W1204 13:29:57.208273179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9109061Z [rank3]:[W1204 13:29:57.590674142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9109237Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9109497Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9109675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9110056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9110273Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9110378Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9110473Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9110592Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9110594Z 2025-12-04T13:44:25.9110829Z [rank3]:[W1204 13:29:57.592733746 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9110999Z [rank2]:[W1204 13:29:57.623226114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9111176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9111430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9111598Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9111967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9112171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9112276Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9112372Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9112471Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9112473Z 2025-12-04T13:44:25.9112707Z [rank2]:[W1204 13:29:57.625705150 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9112877Z [rank1]:[W1204 13:29:58.208386449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9113053Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9113310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9113473Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9113861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9114063Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9114180Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9114276Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9114371Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9114385Z 2025-12-04T13:44:25.9114620Z [rank1]:[W1204 13:29:58.209981714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9114791Z [rank3]:[W1204 13:29:58.592871125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9114967Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9115224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9115385Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9115760Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9115962Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9116069Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9116163Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9116259Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9116262Z 2025-12-04T13:44:25.9116496Z [rank3]:[W1204 13:29:58.595174795 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9116667Z [rank2]:[W1204 13:29:58.625834299 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9116842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9117097Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9117260Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9117688Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9117892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9118011Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9118109Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9118207Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9118209Z 2025-12-04T13:44:25.9118442Z [rank2]:[W1204 13:29:58.627811175 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9118628Z [rank1]:[W1204 13:29:59.210108573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9118801Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9119058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9119219Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9119585Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9119789Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9119892Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9119989Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9120085Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9120087Z 2025-12-04T13:44:25.9120320Z [rank1]:[W1204 13:29:59.211366975 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9120492Z [rank3]:[W1204 13:29:59.595315714 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9120670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9120926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9121089Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9121456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9121679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9121785Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9121890Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9121988Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9121989Z 2025-12-04T13:44:25.9122227Z [rank3]:[W1204 13:29:59.597641393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9122409Z [rank2]:[W1204 13:29:59.627929935 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9122585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9122840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9123005Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9123373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9123577Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9123684Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9123778Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9123876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9123878Z 2025-12-04T13:44:25.9124110Z [rank2]:[W1204 13:29:59.630236914 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9124281Z [rank1]:[W1204 13:30:00.211440816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9124456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9124713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9124876Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9125242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9125453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9125568Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9125664Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9125776Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9125778Z 2025-12-04T13:44:25.9126011Z [rank1]:[W1204 13:30:00.213937561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9126181Z [rank3]:[W1204 13:30:00.597780122 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9126370Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9126627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9126791Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9127159Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9127360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9127467Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9127596Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9127691Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9127694Z 2025-12-04T13:44:25.9127927Z [rank3]:[W1204 13:30:00.599566153 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9128096Z [rank2]:[W1204 13:30:00.630376534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9128272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9128526Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9128691Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9129059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9129262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9129387Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9129495Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9129593Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9129608Z 2025-12-04T13:44:25.9129840Z [rank2]:[W1204 13:30:00.632558386 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9130011Z [rank1]:[W1204 13:30:01.214077731 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9130185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9130455Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9130618Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9130986Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9131190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9131294Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9131391Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9131487Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9131489Z 2025-12-04T13:44:25.9131723Z [rank1]:[W1204 13:30:01.216243163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9131895Z [rank3]:[W1204 13:30:01.599758562 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9132070Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9132330Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9132494Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9132861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9133063Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9133169Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9133264Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9133380Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9133383Z 2025-12-04T13:44:25.9133617Z [rank3]:[W1204 13:30:01.601737138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9133799Z [rank2]:[W1204 13:30:01.633590516 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9133974Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9134239Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9134404Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9134771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9134975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9135080Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9135176Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9135275Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9135277Z 2025-12-04T13:44:25.9135511Z [rank2]:[W1204 13:30:01.636077561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9135682Z [rank1]:[W1204 13:30:02.216364883 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9135855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9136110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9136277Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9136643Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9136844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9136947Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9137044Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9137139Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9137152Z 2025-12-04T13:44:25.9137399Z [rank1]:[W1204 13:30:02.218723392 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9137627Z [rank3]:[W1204 13:30:02.601944707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9137801Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9138057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9138233Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9138603Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9138804Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9138908Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9139002Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9139100Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9139101Z 2025-12-04T13:44:25.9139336Z [rank3]:[W1204 13:30:02.604052730 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9139506Z [rank2]:[W1204 13:30:02.636199082 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9139685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9139946Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9140110Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9140480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9140682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9140786Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9140880Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9140979Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9140981Z 2025-12-04T13:44:25.9141248Z [rank2]:[W1204 13:30:02.638443483 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9141420Z [rank1]:[W1204 13:30:03.218852652 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9141608Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9141866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9142038Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9142406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9142607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9142710Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9142806Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9142902Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9142905Z 2025-12-04T13:44:25.9143139Z [rank1]:[W1204 13:30:03.221245260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9143311Z [rank3]:[W1204 13:30:03.604559983 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9143484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9143742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9143906Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9144278Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9144480Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9144586Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9144682Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9144777Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9144779Z 2025-12-04T13:44:25.9145012Z [rank3]:[W1204 13:30:03.606557589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9145203Z [rank2]:[W1204 13:30:03.638564273 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9145380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9145650Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9145814Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9146196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9146399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9146504Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9146598Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9146695Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9146697Z 2025-12-04T13:44:25.9146928Z [rank2]:[W1204 13:30:03.640373634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9147102Z [rank1]:[W1204 13:30:04.221423950 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9147276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9147571Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9147735Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9148105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9148308Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9148412Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9148509Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9148605Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9148607Z 2025-12-04T13:44:25.9148841Z [rank1]:[W1204 13:30:04.223927684 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9149028Z [rank3]:[W1204 13:30:04.606698970 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9149217Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9149475Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9149649Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9150015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9150234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9150339Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9150436Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9150531Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9150533Z 2025-12-04T13:44:25.9150766Z [rank3]:[W1204 13:30:04.608056510 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9150935Z [rank2]:[W1204 13:30:04.640534354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9151112Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9151365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9151529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9151897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9152100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9152205Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9152302Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9152400Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9152402Z 2025-12-04T13:44:25.9152637Z [rank2]:[W1204 13:30:04.642754215 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9152810Z [rank1]:[W1204 13:30:05.224104295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9152994Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9153261Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9153433Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9153798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9154011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9154116Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9154212Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9154309Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9154311Z 2025-12-04T13:44:25.9154547Z [rank1]:[W1204 13:30:05.226504692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9154717Z [rank3]:[W1204 13:30:05.608236900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9154892Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9155148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9155310Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9155676Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9155876Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9155982Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9156079Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9156174Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9156176Z 2025-12-04T13:44:25.9156410Z [rank3]:[W1204 13:30:05.609978012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9156581Z [rank2]:[W1204 13:30:05.642905696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9156756Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9157034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9157199Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9157617Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9157817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9157935Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9158031Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9158129Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9158130Z 2025-12-04T13:44:25.9158363Z [rank2]:[W1204 13:30:05.645088228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9158533Z [rank1]:[W1204 13:30:06.226621464 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9158706Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9158965Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9159129Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9159495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9159696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9159800Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9159897Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9159993Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9159995Z 2025-12-04T13:44:25.9160228Z [rank1]:[W1204 13:30:06.229061930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9160399Z [rank3]:[W1204 13:30:06.610153473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9160573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9160844Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9161020Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9161387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9161602Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9161717Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9161812Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9161909Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9161911Z 2025-12-04T13:44:25.9162144Z [rank3]:[W1204 13:30:06.611883494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9162315Z [rank2]:[W1204 13:30:06.645224030 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9162490Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9162745Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9162911Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9163283Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9163484Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9163589Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9163685Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9163784Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9163786Z 2025-12-04T13:44:25.9164019Z [rank2]:[W1204 13:30:06.647839542 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9164190Z [rank1]:[W1204 13:30:07.229226221 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9164364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9164619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9164793Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9165176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9165388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9165492Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9165599Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9165694Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9165697Z 2025-12-04T13:44:25.9165932Z [rank1]:[W1204 13:30:07.231671027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9166100Z [rank3]:[W1204 13:30:07.612018266 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9166276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9166532Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9166695Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9167063Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9167268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9167372Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9167468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9167612Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9167614Z 2025-12-04T13:44:25.9167848Z [rank3]:[W1204 13:30:07.613662630 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9168017Z [rank2]:[W1204 13:30:07.647984374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9168192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9168446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9168610Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9169004Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9169219Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9169325Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9169420Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9169538Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9169540Z 2025-12-04T13:44:25.9169779Z [rank2]:[W1204 13:30:07.649805714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9169953Z [rank1]:[W1204 13:30:08.231773080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9170129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9170384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9170546Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9170916Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9171119Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9171225Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9171321Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9171416Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9171420Z 2025-12-04T13:44:25.9171653Z [rank1]:[W1204 13:30:08.233608570 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9171826Z [rank3]:[W1204 13:30:08.613812502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9172001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9172258Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9172419Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9172806Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9173008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9173123Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9173219Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9173315Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9173316Z 2025-12-04T13:44:25.9173562Z [rank3]:[W1204 13:30:08.615160302 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9173733Z [rank2]:[W1204 13:30:08.649919827 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9173910Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9174167Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9174331Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9174701Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9174902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9175009Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9175104Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9175201Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9175203Z 2025-12-04T13:44:25.9175438Z [rank2]:[W1204 13:30:08.652014430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9175609Z [rank1]:[W1204 13:30:09.233751942 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9175783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9176039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9176206Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9176583Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9176796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9176901Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9177006Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9177102Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9177104Z 2025-12-04T13:44:25.9177336Z [rank1]:[W1204 13:30:09.236273287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9177556Z [rank3]:[W1204 13:30:09.615284965 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9177732Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9177989Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9178153Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9178524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9178729Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9178833Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9178929Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9179025Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9179026Z 2025-12-04T13:44:25.9179259Z [rank3]:[W1204 13:30:09.616610956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9179428Z [rank2]:[W1204 13:30:09.652177523 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9179604Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9179857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9180021Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9180387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9180617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9180723Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9180818Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9180929Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9180931Z 2025-12-04T13:44:25.9181163Z [rank2]:[W1204 13:30:09.654258607 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9181332Z [rank1]:[W1204 13:30:10.236387470 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9181521Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9181777Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9181941Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9182309Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9182512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9182616Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9182713Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9182811Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9182813Z 2025-12-04T13:44:25.9183046Z [rank1]:[W1204 13:30:10.238800937 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9183216Z [rank3]:[W1204 13:30:10.616742979 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9183391Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9183647Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9183809Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9184177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9184380Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9184492Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9184600Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9184698Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9184716Z 2025-12-04T13:44:25.9184951Z [rank3]:[W1204 13:30:10.617993852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9185122Z [rank2]:[W1204 13:30:10.654359020 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9185307Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9185563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9185726Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9186095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9186296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9186401Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9186498Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9186597Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9186599Z 2025-12-04T13:44:25.9186835Z [rank2]:[W1204 13:30:10.656144131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9187006Z [rank1]:[W1204 13:30:11.238968369 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9187181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9187437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9187639Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9188003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9188205Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9188310Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9188418Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9188528Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9188530Z 2025-12-04T13:44:25.9188763Z [rank1]:[W1204 13:30:11.241340007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9188949Z [rank3]:[W1204 13:30:11.618148985 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9189122Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9189392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9189556Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9189923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9190128Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9190233Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9190328Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9190424Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9190425Z 2025-12-04T13:44:25.9190658Z [rank3]:[W1204 13:30:11.620121711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9190828Z [rank2]:[W1204 13:30:11.656292044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9191003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9191257Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9191423Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9191795Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9191998Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9192103Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9192199Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9192308Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9192310Z 2025-12-04T13:44:25.9192555Z [rank2]:[W1204 13:30:11.658527905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9192737Z [rank1]:[W1204 13:30:12.241514180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9192912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9193165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9193340Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9193711Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9193914Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9194021Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9194117Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9194214Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9194215Z 2025-12-04T13:44:25.9194448Z [rank1]:[W1204 13:30:12.243497356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9194620Z [rank3]:[W1204 13:30:12.620273584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9194794Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9195049Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9195211Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9195578Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9195781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9195886Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9195982Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9196078Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9196079Z 2025-12-04T13:44:25.9196334Z [rank3]:[W1204 13:30:12.621930028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9196504Z [rank2]:[W1204 13:30:12.658677418 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9196689Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9196943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9197115Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9197522Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9197724Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9197829Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9197924Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9198022Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9198025Z 2025-12-04T13:44:25.9198266Z [rank2]:[W1204 13:30:12.660811511 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9198435Z [rank1]:[W1204 13:30:13.243590871 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9198610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9198864Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9199030Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9199398Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9199600Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9199705Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9199799Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9199895Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9199898Z 2025-12-04T13:44:25.9200145Z [rank1]:[W1204 13:30:13.246076066 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9200336Z [rank3]:[W1204 13:30:13.622041782 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9200511Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9200784Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9200947Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9201329Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9201530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9201634Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9201730Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9201825Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9201826Z 2025-12-04T13:44:25.9202061Z [rank3]:[W1204 13:30:13.623929051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9202232Z [rank2]:[W1204 13:30:13.660965985 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9202408Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9202669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9202832Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9203202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9203403Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9203510Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9203605Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9203703Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9203705Z 2025-12-04T13:44:25.9203938Z [rank2]:[W1204 13:30:13.663246264 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9204128Z [rank1]:[W1204 13:30:14.246221820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9204304Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9204558Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9204733Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9205104Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9205318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9205422Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9205518Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9205614Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9205615Z 2025-12-04T13:44:25.9205847Z [rank1]:[W1204 13:30:14.248711295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9206019Z [rank3]:[W1204 13:30:14.624074825 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9206193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9206454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9206619Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9206987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9207190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9207295Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9207392Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9207525Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9207527Z 2025-12-04T13:44:25.9207760Z [rank3]:[W1204 13:30:14.625915044 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9207930Z [rank2]:[W1204 13:30:14.663365239 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9208132Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9208389Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9208564Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9208935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9209151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9209256Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9209353Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9209451Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9209453Z 2025-12-04T13:44:25.9209686Z [rank2]:[W1204 13:30:14.665583050 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9209856Z [rank1]:[W1204 13:30:15.248876599 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9210032Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9210285Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9210449Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9210817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9211021Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9211127Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9211222Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9211318Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9211320Z 2025-12-04T13:44:25.9211552Z [rank1]:[W1204 13:30:15.251285196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9211720Z [rank3]:[W1204 13:30:15.626081928 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9211896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9212171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9212334Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9212713Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9212927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9213031Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9213127Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9213222Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9213225Z 2025-12-04T13:44:25.9213460Z [rank3]:[W1204 13:30:15.627333590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9213629Z [rank2]:[W1204 13:30:15.665861312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9213806Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9214062Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9214224Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9214594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9214797Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9214906Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9215002Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9215101Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9215103Z 2025-12-04T13:44:25.9215339Z [rank2]:[W1204 13:30:15.668006574 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9215506Z [rank1]:[W1204 13:30:16.251429550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9215680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9215968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9216132Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9216497Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9216709Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9216829Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9216925Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9217023Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9217025Z 2025-12-04T13:44:25.9217257Z [rank1]:[W1204 13:30:16.253898316 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9217428Z [rank3]:[W1204 13:30:16.627499565 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9217657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9217915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9218081Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9218447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9218651Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9218754Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9218850Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9218947Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9218948Z 2025-12-04T13:44:25.9219187Z [rank3]:[W1204 13:30:16.629704726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9219358Z [rank2]:[W1204 13:30:16.668160069 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9219533Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9219788Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9219983Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9220349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9220564Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9220668Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9220778Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9220878Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9220880Z 2025-12-04T13:44:25.9221114Z [rank2]:[W1204 13:30:16.670296232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9221286Z [rank1]:[W1204 13:30:17.254033991 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9221461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9221716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9221881Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9222249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9222452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9222556Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9222651Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9222747Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9222749Z 2025-12-04T13:44:25.9222983Z [rank1]:[W1204 13:30:17.256473497 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9223158Z [rank3]:[W1204 13:30:17.629830511 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9223332Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9223588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9223752Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9224141Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9224353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9224457Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9224552Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9224658Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9224660Z 2025-12-04T13:44:25.9224895Z [rank3]:[W1204 13:30:17.631528904 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9225066Z [rank2]:[W1204 13:30:17.670481996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9225242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9225499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9225664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9226036Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9226239Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9226344Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9226439Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9226536Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9226539Z 2025-12-04T13:44:25.9226773Z [rank2]:[W1204 13:30:17.672797145 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9226943Z [rank1]:[W1204 13:30:18.256586123 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9227118Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9227374Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9227578Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9227973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9228175Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9228294Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9228388Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9228483Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9228506Z 2025-12-04T13:44:25.9228739Z [rank1]:[W1204 13:30:18.258976190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9228910Z [rank3]:[W1204 13:30:18.631684039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9229085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9229343Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9229505Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9229876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9230079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9230185Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9230280Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9230375Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9230377Z 2025-12-04T13:44:25.9230613Z [rank3]:[W1204 13:30:18.633758403 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9230784Z [rank2]:[W1204 13:30:18.672965720 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9230958Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9231214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9231377Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9231758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9231972Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9232088Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9232183Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9232279Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9232281Z 2025-12-04T13:44:25.9232514Z [rank2]:[W1204 13:30:18.675266029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9232695Z [rank1]:[W1204 13:30:19.259093947 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9232871Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9233124Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9233288Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9233654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9233857Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9233963Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9234059Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9234157Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9234159Z 2025-12-04T13:44:25.9234393Z [rank1]:[W1204 13:30:19.261429585 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9234564Z [rank3]:[W1204 13:30:19.633851870 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9234739Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9234995Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9235166Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9235531Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9235757Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9235866Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9235976Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9236071Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9236073Z 2025-12-04T13:44:25.9236310Z [rank3]:[W1204 13:30:19.635051704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9236492Z [rank2]:[W1204 13:30:19.675416064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9236667Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9236924Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9237089Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9237457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9237679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9237785Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9237881Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9237978Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9237980Z 2025-12-04T13:44:25.9238214Z [rank2]:[W1204 13:30:19.677746733 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9238385Z [rank1]:[W1204 13:30:20.261526042 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9238563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9238818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9238981Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9239346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9239547Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9239682Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9239777Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9239873Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9239889Z 2025-12-04T13:44:25.9240121Z [rank1]:[W1204 13:30:20.263936269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9240291Z [rank3]:[W1204 13:30:20.635191370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9240478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9240739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9240903Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9241269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9241472Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9241576Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9241673Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9241769Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9241772Z 2025-12-04T13:44:25.9242007Z [rank3]:[W1204 13:30:20.637098028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9242177Z [rank2]:[W1204 13:30:20.677889089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9242352Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9242611Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9242775Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9243144Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9243345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9243458Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9243581Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9243678Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9243681Z 2025-12-04T13:44:25.9243915Z [rank2]:[W1204 13:30:20.679888425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9244102Z [rank1]:[W1204 13:30:21.264106344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9244277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9244543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9244707Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9245077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9245278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9245383Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9245479Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9245576Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9245578Z 2025-12-04T13:44:25.9245810Z [rank1]:[W1204 13:30:21.266347995 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9245982Z [rank3]:[W1204 13:30:21.637260954 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9246155Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9246413Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9246577Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9246943Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9247147Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9247251Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9247347Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9247461Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9247463Z 2025-12-04T13:44:25.9247738Z [rank3]:[W1204 13:30:21.639949814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9247926Z [rank2]:[W1204 13:30:21.680025482 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9248101Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9248356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9248536Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9248905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9249108Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9249217Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9249314Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9249410Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9249413Z 2025-12-04T13:44:25.9249648Z [rank2]:[W1204 13:30:21.682308801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9249818Z [rank1]:[W1204 13:30:22.266521951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9249992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9250244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9250409Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9250775Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9250977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9251082Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9251178Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9251274Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9251276Z 2025-12-04T13:44:25.9251537Z [rank1]:[W1204 13:30:22.268751002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9251710Z [rank3]:[W1204 13:30:22.640064902 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9251895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9252155Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9252329Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9252696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9252898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9253002Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9253097Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9253193Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9253196Z 2025-12-04T13:44:25.9253431Z [rank3]:[W1204 13:30:22.641765034 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9253602Z [rank2]:[W1204 13:30:22.682452368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9253777Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9254033Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9254196Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9254567Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9254768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9254872Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9254970Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9255066Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9255069Z 2025-12-04T13:44:25.9255324Z [rank2]:[W1204 13:30:22.684496453 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9255494Z [rank1]:[W1204 13:30:23.268860519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9255679Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9255940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9256105Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9256486Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9256687Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9256794Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9256888Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9256985Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9256988Z 2025-12-04T13:44:25.9257220Z [rank1]:[W1204 13:30:23.271502061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9257393Z [rank3]:[W1204 13:30:23.641913631 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9257612Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9257867Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9258042Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9258419Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9258624Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9258729Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9258824Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9258920Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9258924Z 2025-12-04T13:44:25.9259160Z [rank3]:[W1204 13:30:23.643162634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9259363Z [rank2]:[W1204 13:30:23.684672499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9259539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9259812Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9259975Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9260346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9260565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9260670Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9260765Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9260861Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9260863Z 2025-12-04T13:44:25.9261096Z [rank2]:[W1204 13:30:23.686797582 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9261267Z [rank1]:[W1204 13:30:24.271629829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9261444Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9261697Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9261861Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9262233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9262436Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9262545Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9262640Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9262737Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9262739Z 2025-12-04T13:44:25.9262971Z [rank1]:[W1204 13:30:24.274283800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9263144Z [rank3]:[W1204 13:30:24.643271452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9263340Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9263596Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9263770Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9264135Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9264349Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9264455Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9264556Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9264653Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9264656Z 2025-12-04T13:44:25.9264888Z [rank3]:[W1204 13:30:24.645055852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9265061Z [rank2]:[W1204 13:30:24.686935870 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9265238Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9265493Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9265658Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9266029Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9266238Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9266344Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9266440Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9266537Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9266539Z 2025-12-04T13:44:25.9266773Z [rank2]:[W1204 13:30:24.688484695 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9266943Z [rank1]:[W1204 13:30:25.274440707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9267120Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9267393Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9267595Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9267962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9268179Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9268285Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9268381Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9268485Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9268488Z 2025-12-04T13:44:25.9268721Z [rank1]:[W1204 13:30:25.275674900 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9268892Z [rank3]:[W1204 13:30:25.645193070 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9269069Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9269325Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9269488Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9269856Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9270060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9270164Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9270261Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9270360Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9270363Z 2025-12-04T13:44:25.9270598Z [rank3]:[W1204 13:30:25.647022380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9270770Z [rank2]:[W1204 13:30:25.688637753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9270948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9271230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9271393Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9271775Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9271977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9272090Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9272188Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9272284Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9272286Z 2025-12-04T13:44:25.9272518Z [rank2]:[W1204 13:30:25.690825225 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9272688Z [rank1]:[W1204 13:30:26.275831227 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9272862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9273121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9273288Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9273653Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9273855Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9273960Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9274055Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9274153Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9274155Z 2025-12-04T13:44:25.9274389Z [rank1]:[W1204 13:30:26.278291943 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9274562Z [rank3]:[W1204 13:30:26.647330314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9274736Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9274992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9275184Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9275555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9275768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9275882Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9275977Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9276073Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9276076Z 2025-12-04T13:44:25.9276307Z [rank3]:[W1204 13:30:26.649351139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9276478Z [rank2]:[W1204 13:30:26.690983872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9276651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9276906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9277072Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9277439Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9277674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9277779Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9277875Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9277971Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9277973Z 2025-12-04T13:44:25.9278208Z [rank2]:[W1204 13:30:26.693794130 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9278379Z [rank1]:[W1204 13:30:27.278418362 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9278553Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9278808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9278985Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9279366Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9279579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9279685Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9279781Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9279893Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9279895Z 2025-12-04T13:44:25.9280129Z [rank1]:[W1204 13:30:27.280868238 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9280300Z [rank3]:[W1204 13:30:27.649466878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9280476Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9280729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9280894Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9281262Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9281464Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9281568Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9281664Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9281761Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9281763Z 2025-12-04T13:44:25.9281999Z [rank3]:[W1204 13:30:27.651541362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9282170Z [rank2]:[W1204 13:30:27.693934888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9282344Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9282600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9282763Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9283151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9283363Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9283467Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9283564Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9283660Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9283673Z 2025-12-04T13:44:25.9283907Z [rank2]:[W1204 13:30:27.696179669 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9284079Z [rank1]:[W1204 13:30:28.281017546 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9284257Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9284513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9284676Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9285045Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9285245Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9285352Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9285447Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9285544Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9285546Z 2025-12-04T13:44:25.9285779Z [rank1]:[W1204 13:30:28.283477852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9285951Z [rank3]:[W1204 13:30:28.651654482 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9286125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9286384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9286548Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9286938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9287140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9287256Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9287352Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9287450Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9287451Z 2025-12-04T13:44:25.9287722Z [rank3]:[W1204 13:30:28.653597949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9287915Z [rank2]:[W1204 13:30:28.696330747 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9288090Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9288348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9288512Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9288884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9289089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9289192Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9289290Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9289386Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9289387Z 2025-12-04T13:44:25.9289620Z [rank2]:[W1204 13:30:28.698601857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9289791Z [rank1]:[W1204 13:30:29.283640710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9289967Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9290222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9290385Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9290754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9290984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9291091Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9291198Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9291295Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9291296Z 2025-12-04T13:44:25.9291530Z [rank1]:[W1204 13:30:29.285918860 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9291710Z [rank3]:[W1204 13:30:29.653702068 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9291887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9292144Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9292308Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9292678Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9292886Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9292991Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9293086Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9293185Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9293187Z 2025-12-04T13:44:25.9293419Z [rank3]:[W1204 13:30:29.655714704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9293593Z [rank2]:[W1204 13:30:29.698739436 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9293767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9294025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9294188Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9294557Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9294773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9294890Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9294990Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9295097Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9295099Z 2025-12-04T13:44:25.9295331Z [rank2]:[W1204 13:30:29.700706243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9295501Z [rank1]:[W1204 13:30:30.286108098 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9295688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9295945Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9296108Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9296473Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9296675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9296781Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9296876Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9296976Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9296979Z 2025-12-04T13:44:25.9297216Z [rank1]:[W1204 13:30:30.287721032 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9297386Z [rank3]:[W1204 13:30:30.655844183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9297600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9297856Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9298019Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9298388Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9298592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9298710Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9298819Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9298918Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9298935Z 2025-12-04T13:44:25.9299169Z [rank3]:[W1204 13:30:30.658251020 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9299341Z [rank2]:[W1204 13:30:30.700881701 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9299528Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9299786Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9299949Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9300318Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9300520Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9300626Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9300726Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9300822Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9300824Z 2025-12-04T13:44:25.9301060Z [rank2]:[W1204 13:30:30.702945336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9301231Z [rank1]:[W1204 13:30:31.287881511 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9301405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9301661Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9301823Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9302193Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9302395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9302502Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9302597Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9302712Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9302714Z 2025-12-04T13:44:25.9302950Z [rank1]:[W1204 13:30:31.289496766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9303136Z [rank3]:[W1204 13:30:31.658386120 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9303314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9303581Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9303746Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9304112Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9304316Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9304422Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9304518Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9304615Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9304618Z 2025-12-04T13:44:25.9304850Z [rank3]:[W1204 13:30:31.660449385 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9305022Z [rank2]:[W1204 13:30:31.703098955 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9305197Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9305455Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9305622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9305992Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9306196Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9306301Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9306399Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9306495Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9306507Z 2025-12-04T13:44:25.9306751Z [rank2]:[W1204 13:30:31.705082521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9306931Z [rank1]:[W1204 13:30:32.289642695 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9307107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9307361Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9307567Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9307939Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9308142Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9308247Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9308342Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9308439Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9308441Z 2025-12-04T13:44:25.9308676Z [rank1]:[W1204 13:30:32.290914427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9308848Z [rank3]:[W1204 13:30:32.660590994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9309024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9309279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9309443Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9309814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9310020Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9310124Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9310219Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9310316Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9310318Z 2025-12-04T13:44:25.9310578Z [rank3]:[W1204 13:30:32.662355586 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9310752Z [rank2]:[W1204 13:30:32.705258630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9310941Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9311196Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9311372Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9311739Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9311942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9312048Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9312146Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9312242Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9312245Z 2025-12-04T13:44:25.9312481Z [rank2]:[W1204 13:30:32.707618759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9312652Z [rank1]:[W1204 13:30:33.291085097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9312827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9313083Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9313245Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9313615Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9313817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9313922Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9314017Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9314113Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9314116Z 2025-12-04T13:44:25.9314354Z [rank1]:[W1204 13:30:33.292681251 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9314541Z [rank3]:[W1204 13:30:33.662472626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9314718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9314984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9315147Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9315526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9315728Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9315834Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9315929Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9316024Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9316026Z 2025-12-04T13:44:25.9316257Z [rank3]:[W1204 13:30:33.663620311 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9316433Z [rank2]:[W1204 13:30:33.707789888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9316610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9316868Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9317033Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9317400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9317634Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9317739Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9317835Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9317931Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9317933Z 2025-12-04T13:44:25.9318167Z [rank2]:[W1204 13:30:33.709619288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9318355Z [rank1]:[W1204 13:30:34.292841941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9318547Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9318810Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9318986Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9319356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9319571Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9319678Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9319774Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9319871Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9319874Z 2025-12-04T13:44:25.9320108Z [rank1]:[W1204 13:30:34.294100164 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9320279Z [rank3]:[W1204 13:30:34.663726462 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9320455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9320710Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9320876Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9321244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9321451Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9321555Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9321651Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9321748Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9321750Z 2025-12-04T13:44:25.9321982Z [rank3]:[W1204 13:30:34.665344136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9322153Z [rank2]:[W1204 13:30:34.709781638 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9322337Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9322604Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9322781Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9323150Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9323365Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9323470Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9323568Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9323665Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9323667Z 2025-12-04T13:44:25.9323901Z [rank2]:[W1204 13:30:34.712165685 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9323948Z PASSED [198.6648s] [ 5%] 2025-12-04T13:44:25.9324250Z distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline I1204 13:30:35.199000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 73639 2025-12-04T13:44:25.9324404Z I1204 13:30:35.200000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 73640 2025-12-04T13:44:25.9324554Z I1204 13:30:35.200000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 73641 2025-12-04T13:44:25.9324702Z I1204 13:30:35.201000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 73642 2025-12-04T13:44:25.9324874Z [rank1]:[W1204 13:30:35.294246824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9325051Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9325308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9325478Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9325850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9326051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9326159Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9326255Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9326378Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9326381Z 2025-12-04T13:44:25.9326613Z [rank1]:[W1204 13:30:35.295480607 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9326797Z [rank3]:[W1204 13:30:35.665513056 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9326972Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9327229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9327406Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9327815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9328019Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9328124Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9328222Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9328318Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9328321Z 2025-12-04T13:44:25.9328557Z [rank3]:[W1204 13:30:35.667262928 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9328731Z [rank2]:[W1204 13:30:35.712327645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9328903Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9329158Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9329322Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9329692Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9329895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9330001Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9330097Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9330194Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9330196Z 2025-12-04T13:44:25.9330458Z [rank2]:[W1204 13:30:35.714434649 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9330628Z [rank1]:[W1204 13:30:36.295631288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9330815Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9331070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9331249Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9331620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9331822Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9331928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9332023Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9332119Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9332121Z 2025-12-04T13:44:25.9332354Z [rank1]:[W1204 13:30:36.296897670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9332525Z [rank3]:[W1204 13:30:36.667398639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9332700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9332957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9333123Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9333490Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9333694Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9333798Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9333894Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9333990Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9333994Z 2025-12-04T13:44:25.9334251Z [rank3]:[W1204 13:30:36.668638262 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9334435Z [rank2]:[W1204 13:30:36.714604359 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9334618Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9334873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9335036Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9335416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9335619Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9335724Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9335821Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9335918Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9335921Z 2025-12-04T13:44:25.9336158Z [rank2]:[W1204 13:30:36.716478548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9336328Z [rank1]:[W1204 13:30:37.297050541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9336504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9336758Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9336923Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9337293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9337536Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9337644Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9337739Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9337836Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9337838Z 2025-12-04T13:44:25.9338075Z [rank1]:[W1204 13:30:37.298357882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9338273Z [rank3]:[W1204 13:30:37.668801343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9338449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9338717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9338880Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9339244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9339465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9339570Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9339667Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9339762Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9339766Z 2025-12-04T13:44:25.9339997Z [rank3]:[W1204 13:30:37.670008366 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9340171Z [rank2]:[W1204 13:30:37.716624859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9340348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9340602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9340765Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9341133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9341337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9341440Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9341538Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9341634Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9341636Z 2025-12-04T13:44:25.9341870Z [rank2]:[W1204 13:30:37.718471689 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9342042Z [rank1]:[W1204 13:30:38.298508903 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9342235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9342492Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9342667Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9343035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9343247Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9343352Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9343446Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9343543Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9343545Z 2025-12-04T13:44:25.9343777Z [rank1]:[W1204 13:30:38.299764695 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9343948Z [rank3]:[W1204 13:30:38.670170907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9344125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9344379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9344544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9344912Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9345115Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9345220Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9345317Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9345414Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9345417Z 2025-12-04T13:44:25.9345648Z [rank3]:[W1204 13:30:38.671391550 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9345818Z [rank2]:[W1204 13:30:38.718613340 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9345992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9346269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9346442Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9346816Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9347036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9347140Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9347237Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9347333Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9347335Z 2025-12-04T13:44:25.9347605Z [rank2]:[W1204 13:30:38.719821094 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9347775Z [rank1]:[W1204 13:30:39.299907007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9347951Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9348207Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9348369Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9348739Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9348943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9349050Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9349146Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9349242Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9349244Z 2025-12-04T13:44:25.9349477Z [rank1]:[W1204 13:30:39.301297247 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9349647Z [rank3]:[W1204 13:30:39.671566442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9349822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9350105Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9350271Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9350651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9350856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9350974Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9351072Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9351168Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9351170Z 2025-12-04T13:44:25.9351406Z [rank3]:[W1204 13:30:39.673042269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9351577Z [rank2]:[W1204 13:30:39.719981485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9351749Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9352008Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9352171Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9352538Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9352742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9352847Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9356860Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9356965Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9356968Z 2025-12-04T13:44:25.9357213Z [rank2]:[W1204 13:30:39.721795725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9357386Z [rank1]:[W1204 13:30:40.303650030 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9357601Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9357859Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9358059Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9358426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9358643Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9358764Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9358859Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9358958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9358961Z 2025-12-04T13:44:25.9359195Z [rank1]:[W1204 13:30:40.305128117 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9359368Z [rank3]:[W1204 13:30:40.673213781 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9359546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9359804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9359971Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9360335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9360540Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9360644Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9360743Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9360839Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9360844Z 2025-12-04T13:44:25.9361078Z [rank3]:[W1204 13:30:40.674450753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9361251Z [rank2]:[W1204 13:30:40.721962157 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9361426Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9361682Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9361858Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9362237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9362450Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9362553Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9362649Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9362760Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9362762Z 2025-12-04T13:44:25.9362999Z [rank2]:[W1204 13:30:40.723837736 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9363168Z [rank1]:[W1204 13:30:41.305280350 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9363346Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9363602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9363766Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9364138Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9364339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9364445Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9364539Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9364637Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9364639Z 2025-12-04T13:44:25.9364872Z [rank1]:[W1204 13:30:41.306693608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9365041Z [rank3]:[W1204 13:30:41.674580416 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9365217Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9365472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9365636Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9366027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9366230Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9366352Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9366449Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9366545Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9366558Z 2025-12-04T13:44:25.9366793Z [rank3]:[W1204 13:30:41.676784417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9366966Z [rank2]:[W1204 13:30:41.724024957 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9367139Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9367396Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9367588Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9367958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9368162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9368268Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9368368Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9368464Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9368467Z 2025-12-04T13:44:25.9368701Z [rank2]:[W1204 13:30:41.726132321 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9368871Z [rank1]:[W1204 13:30:42.306866501 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9369046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9369300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9369462Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9369858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9370061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9370180Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9370274Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9370370Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9370372Z 2025-12-04T13:44:25.9370602Z [rank1]:[W1204 13:30:42.309072922 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9370787Z [rank3]:[W1204 13:30:42.676975959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9370964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9371221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9371384Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9371749Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9371951Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9372055Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9372152Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9372249Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9372251Z 2025-12-04T13:44:25.9372485Z [rank3]:[W1204 13:30:42.679169051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9372656Z [rank2]:[W1204 13:30:42.726289183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9372831Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9373087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9373249Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9373619Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9373842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9373948Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9374056Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9374152Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9374154Z 2025-12-04T13:44:25.9374387Z [rank2]:[W1204 13:30:42.728204791 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9374569Z [rank1]:[W1204 13:30:43.309194455 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9374746Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9375000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9375165Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9375532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9375735Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9375841Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9375934Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9376031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9376033Z 2025-12-04T13:44:25.9376264Z [rank1]:[W1204 13:30:43.311616692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9376435Z [rank3]:[W1204 13:30:43.679297964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9376612Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9376871Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9377033Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9377398Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9377653Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9377769Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9377866Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9377975Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9377977Z 2025-12-04T13:44:25.9378210Z [rank3]:[W1204 13:30:43.681763660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9378379Z [rank2]:[W1204 13:30:43.728374493 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9378569Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9378828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9378993Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9379362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9379569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9379674Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9379770Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9379865Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9379868Z 2025-12-04T13:44:25.9380102Z [rank2]:[W1204 13:30:43.730495187 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9380272Z [rank1]:[W1204 13:30:44.311793335 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9380449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9380705Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9380867Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9381239Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9381439Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9381554Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9381665Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9381761Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9381774Z 2025-12-04T13:44:25.9382007Z [rank1]:[W1204 13:30:44.314288190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9382216Z [rank1]:W1204 13:30:44.658000 73640 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.9382402Z [rank3]:[W1204 13:30:44.681892484 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9382576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9382834Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9382997Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9383372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9383577Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9383682Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9383778Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9383874Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9383876Z 2025-12-04T13:44:25.9384110Z [rank3]:[W1204 13:30:44.683237444 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9384281Z [rank2]:[W1204 13:30:44.730650250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9384456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9384713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9384877Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9385246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9385448Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9385572Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9385668Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9385765Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9385777Z 2025-12-04T13:44:25.9386010Z [rank2]:[W1204 13:30:44.732436991 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9386182Z [rank1]:[W1204 13:30:45.314473902 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9386367Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9386622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9386785Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9387152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9387353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9387459Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9387596Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9387693Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9387697Z 2025-12-04T13:44:25.9387929Z [rank1]:[W1204 13:30:45.316933158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9388099Z [rank3]:[W1204 13:30:45.683359128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9388273Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9388532Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9388696Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9389065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9389269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9389374Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9389487Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9389597Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9389599Z 2025-12-04T13:44:25.9389834Z [rank3]:[W1204 13:30:45.685291186 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9390017Z [rank2]:[W1204 13:30:45.732583804 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9390193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9390462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9390626Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9390994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9391194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9391300Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9391394Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9391493Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9391495Z 2025-12-04T13:44:25.9391727Z [rank2]:[W1204 13:30:45.733817037 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9391899Z [rank1]:[W1204 13:30:46.317125291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9392076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9392332Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9392494Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9392861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9393064Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9393168Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9393265Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9393372Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9393384Z 2025-12-04T13:44:25.9393618Z [rank1]:[W1204 13:30:46.319395931 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9393799Z [rank3]:[W1204 13:30:46.685481048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9393973Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9394233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9394409Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9394777Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9394979Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9395083Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9395179Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9395274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9395276Z 2025-12-04T13:44:25.9395511Z [rank3]:[W1204 13:30:46.687275239 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9395680Z [rank2]:[W1204 13:30:46.733960551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9395857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9396110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9396276Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9396647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9396849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9396956Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9397050Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9397150Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9397153Z 2025-12-04T13:44:25.9397413Z [rank2]:[W1204 13:30:46.736455976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9397600Z [rank1]:[W1204 13:30:47.319558514 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9397792Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9398046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9398223Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9398594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9398796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9398899Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9398994Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9399092Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9399094Z 2025-12-04T13:44:25.9399327Z [rank1]:[W1204 13:30:47.320828596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9399497Z [rank3]:[W1204 13:30:47.687415953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9399671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9399926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9400090Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9400464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9400666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9400769Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9400864Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9400959Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9400962Z 2025-12-04T13:44:25.9401210Z [rank3]:[W1204 13:30:47.689723172 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9401393Z [rank2]:[W1204 13:30:47.736601330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9401570Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9401835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9401998Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9402378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9402580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9402687Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9402781Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9402879Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9402882Z 2025-12-04T13:44:25.9403114Z [rank2]:[W1204 13:30:47.738853930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9403285Z [rank1]:[W1204 13:30:48.321014010 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9403459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9403716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9403879Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9404247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9404449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9404553Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9404649Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9404746Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9404747Z 2025-12-04T13:44:25.9404980Z [rank1]:[W1204 13:30:48.323448156 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9405171Z [rank3]:[W1204 13:30:48.689864667 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9405345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9405614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9405775Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9406146Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9406359Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9406462Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9406558Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9406653Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9406655Z 2025-12-04T13:44:25.9406895Z [rank3]:[W1204 13:30:48.692340242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9407067Z [rank2]:[W1204 13:30:48.738991675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9407243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9407542Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9407705Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9408072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9408276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9408380Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9408477Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9408574Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9408576Z 2025-12-04T13:44:25.9408809Z [rank2]:[W1204 13:30:48.740201908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9408981Z [rank1]:[W1204 13:30:49.323632960 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9409186Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9409441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9409619Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9409983Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9410204Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9410308Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9410403Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9410500Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9410501Z 2025-12-04T13:44:25.9410733Z [rank1]:[W1204 13:30:49.326011218 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9410904Z [rank3]:[W1204 13:30:49.692477887 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9411079Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9411339Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9411503Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9411869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9412072Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9412176Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9412272Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9412368Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9412369Z 2025-12-04T13:44:25.9412602Z [rank3]:[W1204 13:30:49.696148167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9412771Z [rank2]:[W1204 13:30:49.740372023 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9412946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9413221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9413384Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9413763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9413974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9414081Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9414176Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9414272Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9414275Z 2025-12-04T13:44:25.9414507Z [rank2]:[W1204 13:30:49.742652072 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9414676Z [rank1]:[W1204 13:30:50.326193432 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9414852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9415107Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9415270Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9415639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9415843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9415949Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9416045Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9416141Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9416143Z 2025-12-04T13:44:25.9416376Z [rank1]:[W1204 13:30:50.328568240 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9416547Z [rank3]:[W1204 13:30:50.696248753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9416722Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9416997Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9417159Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9417573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9417775Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9417894Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9417990Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9418088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9418090Z 2025-12-04T13:44:25.9418325Z [rank3]:[W1204 13:30:50.697470046 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9418495Z [rank2]:[W1204 13:30:50.742795538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9418670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9418927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9419090Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9419456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9419659Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9419766Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9419860Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9419958Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9419960Z 2025-12-04T13:44:25.9420197Z [rank2]:[W1204 13:30:50.744052970 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9420405Z [rank0]:W1204 13:30:51.175000 73639 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.9420575Z [rank1]:[W1204 13:30:51.328712635 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9420750Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9421040Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9421203Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9421584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9421787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9421905Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9422001Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9422097Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9422099Z 2025-12-04T13:44:25.9422338Z [rank1]:[W1204 13:30:51.329952228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9422507Z [rank3]:[W1204 13:30:51.697631921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9422681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9422939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9423101Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9423471Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9423673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9423779Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9423876Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9423973Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9423975Z 2025-12-04T13:44:25.9424208Z [rank3]:[W1204 13:30:51.699726695 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9424380Z [rank2]:[W1204 13:30:51.744364762 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9424556Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9424825Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9424997Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9425363Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9425581Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9425696Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9425792Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9425890Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9425891Z 2025-12-04T13:44:25.9426123Z [rank2]:[W1204 13:30:51.746542714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9426294Z [rank1]:[W1204 13:30:52.330088294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9426468Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9426727Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9426890Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9427255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9427457Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9427586Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9427682Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9427778Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9427780Z 2025-12-04T13:44:25.9428013Z [rank1]:[W1204 13:30:52.331338446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9428182Z [rank3]:[W1204 13:30:52.699876340 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9428356Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9428614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9428795Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9429175Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9429390Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9429495Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9429604Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9429701Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9429703Z 2025-12-04T13:44:25.9429938Z [rank3]:[W1204 13:30:52.701973824 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9430109Z [rank2]:[W1204 13:30:52.746699769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9430284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9430545Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9430709Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9431075Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9431277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9431381Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9431477Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9431575Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9431577Z 2025-12-04T13:44:25.9431812Z [rank2]:[W1204 13:30:52.748605057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9431982Z [rank1]:[W1204 13:30:53.331464762 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9432156Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9432410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9432573Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9432964Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9433174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9433278Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9433375Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9433481Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9433483Z 2025-12-04T13:44:25.9433718Z [rank1]:[W1204 13:30:53.333230883 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9433887Z [rank3]:[W1204 13:30:53.702137960 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9434062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9434318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9434479Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9434851Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9435053Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9435159Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9435253Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9435349Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9435352Z 2025-12-04T13:44:25.9435586Z [rank3]:[W1204 13:30:53.704260733 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9435756Z [rank2]:[W1204 13:30:53.748736394 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9435931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9436186Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9436349Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9436737Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9436940Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9437055Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9437150Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9437247Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9437249Z 2025-12-04T13:44:25.9437523Z [rank2]:[W1204 13:30:53.750084844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9437694Z [rank1]:[W1204 13:30:54.333373800 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9437866Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9438126Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9438287Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9438655Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9438857Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9438962Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9439057Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9439154Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9439155Z 2025-12-04T13:44:25.9439391Z [rank1]:[W1204 13:30:54.335713608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9439561Z [rank3]:[W1204 13:30:54.704392100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9439735Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9439991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9440153Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9440520Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9440758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9440863Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9440970Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9441065Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9441067Z 2025-12-04T13:44:25.9441301Z [rank3]:[W1204 13:30:54.705944086 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9441487Z [rank2]:[W1204 13:30:54.750233640 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9441663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9441918Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9442082Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9442447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9442653Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9442757Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9442853Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9442949Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9442951Z 2025-12-04T13:44:25.9443182Z [rank2]:[W1204 13:30:54.752009451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9443353Z [rank1]:[W1204 13:30:55.335837745 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9443529Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9443787Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9443950Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9444315Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9444530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9444643Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9444739Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9444846Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9444847Z 2025-12-04T13:44:25.9445080Z [rank1]:[W1204 13:30:55.337827391 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9445250Z [rank3]:[W1204 13:30:55.706090872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9445437Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9445696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9445858Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9446227Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9446428Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9446534Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9446629Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9446725Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9446728Z 2025-12-04T13:44:25.9446962Z [rank3]:[W1204 13:30:55.707973061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9447131Z [rank2]:[W1204 13:30:55.752155698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9447306Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9447597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9447761Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9448135Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9448338Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9448456Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9448564Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9448662Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9448680Z 2025-12-04T13:44:25.9448912Z [rank2]:[W1204 13:30:55.753385851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9449083Z [rank1]:[W1204 13:30:56.337984638 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9449272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9449530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9449691Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9450059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9450263Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9450368Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9450464Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9450559Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9450561Z 2025-12-04T13:44:25.9450794Z [rank1]:[W1204 13:30:56.339264720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9450963Z [rank3]:[W1204 13:30:56.708115828 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9451136Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9451395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9451556Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9451922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9452123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9452230Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9452337Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9452446Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9452449Z 2025-12-04T13:44:25.9452683Z [rank3]:[W1204 13:30:56.709991066 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9452863Z [rank2]:[W1204 13:30:56.753539777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9453036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9453307Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9453471Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9453835Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9454039Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9454145Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9454240Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9454340Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9454341Z 2025-12-04T13:44:25.9454579Z [rank2]:[W1204 13:30:56.755386227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9454752Z [rank1]:[W1204 13:30:57.339410557 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9454926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9455183Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9455348Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9455715Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9455917Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9456020Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9456116Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9456232Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9456234Z 2025-12-04T13:44:25.9456482Z [rank1]:[W1204 13:30:57.340665139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9456663Z [rank3]:[W1204 13:30:57.710171313 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9456840Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9457100Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9457273Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9457682Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9457883Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9457988Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9458084Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9458181Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9458182Z 2025-12-04T13:44:25.9458417Z [rank3]:[W1204 13:30:57.711986553 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9458585Z [rank2]:[W1204 13:30:57.755538214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9458761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9459017Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9459182Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9459549Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9459751Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9459856Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9459950Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9460048Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9460050Z 2025-12-04T13:44:25.9460313Z [rank2]:[W1204 13:30:57.757508780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9460483Z [rank1]:[W1204 13:30:58.340809277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9460671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9460925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9461102Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9461472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9461673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9461776Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9461872Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9461966Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9461969Z 2025-12-04T13:44:25.9462204Z [rank1]:[W1204 13:30:58.342063629 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9462373Z [rank3]:[W1204 13:30:58.712179479 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9462549Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9462804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9462965Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9463335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9463537Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9463642Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9463739Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9463835Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9463838Z 2025-12-04T13:44:25.9464080Z [rank3]:[W1204 13:30:58.714255484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9464258Z [rank2]:[W1204 13:30:58.757687417 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9464433Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9464702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9464866Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9465247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9465449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9465556Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9465650Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9465747Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9465749Z 2025-12-04T13:44:25.9465983Z [rank2]:[W1204 13:30:58.759849520 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9466153Z [rank1]:[W1204 13:30:59.342192127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9466328Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9466584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9466745Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9467111Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9467314Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9467418Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9467551Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9467647Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9467649Z 2025-12-04T13:44:25.9467887Z [rank1]:[W1204 13:30:59.343726863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9468070Z [rank3]:[W1204 13:30:59.714434001 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9468258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9468514Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9468691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9469058Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9469281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9469385Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9469481Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9469578Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9469580Z 2025-12-04T13:44:25.9469819Z [rank3]:[W1204 13:30:59.716797529 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9469990Z [rank2]:[W1204 13:30:59.759990798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9470166Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9470418Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9470582Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9470950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9471154Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9471259Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9471355Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9471451Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9471453Z 2025-12-04T13:44:25.9471685Z [rank2]:[W1204 13:30:59.761773808 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9471855Z [rank1]:[W1204 13:31:00.343875881 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9472054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9472310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9472485Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9472849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9473062Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9473166Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9473262Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9473359Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9473361Z 2025-12-04T13:44:25.9473595Z [rank1]:[W1204 13:31:00.345116424 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9473764Z [rank3]:[W1204 13:31:00.716947687 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9473940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9474199Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9474362Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9474728Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9474929Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9475035Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9475130Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9475226Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9475229Z 2025-12-04T13:44:25.9475463Z [rank3]:[W1204 13:31:00.719226656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9475631Z [rank2]:[W1204 13:31:00.761904517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9475806Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9476080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9476244Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9476625Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9476839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9476944Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9477040Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9477136Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9477139Z 2025-12-04T13:44:25.9477370Z [rank2]:[W1204 13:31:00.764258225 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9477573Z [rank1]:[W1204 13:31:01.345256372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9477749Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9478005Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9478167Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9478535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9478739Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9478843Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9478940Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9479036Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9479037Z 2025-12-04T13:44:25.9479270Z [rank1]:[W1204 13:31:01.346510645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9479440Z [rank3]:[W1204 13:31:01.719402824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9479613Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9479897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9480059Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9480433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9480648Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9480766Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9480863Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9480958Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9480960Z 2025-12-04T13:44:25.9481195Z [rank3]:[W1204 13:31:01.721720133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9481365Z [rank2]:[W1204 13:31:01.764406323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9481541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9481797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9481961Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9482330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9482531Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9482638Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9482734Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9482832Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9482834Z 2025-12-04T13:44:25.9483066Z [rank2]:[W1204 13:31:01.765929410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9483275Z [rank3]:W1204 13:31:02.343000 73642 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.9483444Z [rank1]:[W1204 13:31:02.346645723 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9483620Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9483895Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9484057Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9484439Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9484642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9484761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9484859Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9484957Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9484959Z 2025-12-04T13:44:25.9485193Z [rank1]:[W1204 13:31:02.347887896 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9485364Z [rank3]:[W1204 13:31:02.721944180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9485538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9485794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9485957Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9486326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9486529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9486637Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9486732Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9486830Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9486832Z 2025-12-04T13:44:25.9487067Z [rank3]:[W1204 13:31:02.724142092 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9487238Z [rank2]:[W1204 13:31:02.766103608 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9487411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9487707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9487895Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9488265Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9488479Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9488598Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9488696Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9488793Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9488795Z 2025-12-04T13:44:25.9489029Z [rank2]:[W1204 13:31:02.768351498 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9489199Z [rank1]:[W1204 13:31:03.348027835 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9489374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9489630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9489794Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9490162Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9490363Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9490468Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9490563Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9490660Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9490662Z 2025-12-04T13:44:25.9490896Z [rank1]:[W1204 13:31:03.349361236 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9491066Z [rank3]:[W1204 13:31:03.724266601 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9491241Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9491498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9491672Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9492048Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9492260Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9492365Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9492470Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9492566Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9492570Z 2025-12-04T13:44:25.9492803Z [rank3]:[W1204 13:31:03.726643729 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9492973Z [rank2]:[W1204 13:31:03.768513217 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9493147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9493403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9493568Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9493937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9494139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9494243Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9494339Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9494437Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9494439Z 2025-12-04T13:44:25.9494674Z [rank2]:[W1204 13:31:03.770830706 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9494842Z [rank1]:[W1204 13:31:04.349476016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9495017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9495272Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9495434Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9495825Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9496036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9496141Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9496235Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9496332Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9500207Z 2025-12-04T13:44:25.9500444Z [rank1]:[W1204 13:31:04.350783147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9500614Z [rank3]:[W1204 13:31:04.726833747 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9500790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9501046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9501209Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9501578Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9501781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9501887Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9501984Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9502081Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9502083Z 2025-12-04T13:44:25.9502315Z [rank3]:[W1204 13:31:04.728729335 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9502486Z [rank2]:[W1204 13:31:04.770981085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9502660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9502918Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9503081Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9503483Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9503686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9503802Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9503900Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9503996Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9503998Z 2025-12-04T13:44:25.9504231Z [rank2]:[W1204 13:31:04.773581668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9504418Z [rank1]:[W1204 13:31:05.350924266 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9504593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9504850Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9505010Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9505377Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9505579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9505686Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9505781Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9505877Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9505878Z 2025-12-04T13:44:25.9506115Z [rank1]:[W1204 13:31:05.352168739 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9506287Z [rank3]:[W1204 13:31:05.728882245 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9506462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9506718Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9506883Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9507249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9507486Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9507593Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9507702Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9507797Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9507799Z 2025-12-04T13:44:25.9508032Z [rank3]:[W1204 13:31:05.731062197 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9508219Z [rank2]:[W1204 13:31:05.773759127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9508396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9508653Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9508818Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9509185Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9509391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9509494Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9509591Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9509689Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9509691Z 2025-12-04T13:44:25.9509925Z [rank2]:[W1204 13:31:05.776072466 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9510095Z [rank1]:[W1204 13:31:06.352302049 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9510269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9510530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9510694Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9511065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9511281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9511399Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9511495Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9511600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9511603Z 2025-12-04T13:44:25.9511836Z [rank1]:[W1204 13:31:06.353534232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9512005Z [rank3]:[W1204 13:31:06.731190887 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9512192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9512450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9512614Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9512982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9513185Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9513292Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9513387Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9513483Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9513486Z 2025-12-04T13:44:25.9513719Z [rank3]:[W1204 13:31:06.733593544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9513890Z [rank2]:[W1204 13:31:06.776225445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9514064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9514321Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9514485Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9514857Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9515060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9515175Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9515281Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9515377Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9515389Z 2025-12-04T13:44:25.9515623Z [rank2]:[W1204 13:31:06.778511815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9515830Z [rank2]:W1204 13:31:07.212000 73641 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:25.9516011Z [rank1]:[W1204 13:31:07.353666082 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9516187Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9516442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9516609Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9516982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9517186Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9517292Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9517385Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9517513Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9517514Z 2025-12-04T13:44:25.9517747Z [rank1]:[W1204 13:31:07.354955254 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9517917Z [rank3]:[W1204 13:31:07.733712225 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9518092Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9518349Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9518511Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9518887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9519089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9519223Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9519319Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9519415Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9519429Z 2025-12-04T13:44:25.9519663Z [rank3]:[W1204 13:31:07.736311737 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9519832Z [rank2]:[W1204 13:31:07.778668385 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9520018Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9520275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9520437Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9520806Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9521010Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9521116Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9521211Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9521308Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9521310Z 2025-12-04T13:44:25.9521542Z [rank2]:[W1204 13:31:07.780892576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9521711Z [rank1]:[W1204 13:31:08.355100874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9521886Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9522142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9522303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9522669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9522870Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9522976Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9523091Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9523190Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9523192Z 2025-12-04T13:44:25.9523424Z [rank1]:[W1204 13:31:08.356345187 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9523604Z [rank3]:[W1204 13:31:08.736478678 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9523777Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9524045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9524207Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9524573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9524776Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9524880Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9524976Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9525072Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9525074Z 2025-12-04T13:44:25.9525309Z [rank3]:[W1204 13:31:08.738551212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9525479Z [rank2]:[W1204 13:31:08.781078526 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9525653Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9525908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9526071Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9526437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9526639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9526743Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9526838Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9526946Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9526958Z 2025-12-04T13:44:25.9527192Z [rank2]:[W1204 13:31:08.782419786 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9527372Z [rank1]:[W1204 13:31:09.356517377 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9527586Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9527840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9528016Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9528381Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9528583Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9528688Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9528784Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9528880Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9528883Z 2025-12-04T13:44:25.9529116Z [rank1]:[W1204 13:31:09.358366886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9529288Z [rank3]:[W1204 13:31:09.738703422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9529462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9529720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9529887Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9530253Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9530454Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9530558Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9530655Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9530752Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9530754Z 2025-12-04T13:44:25.9531014Z [rank3]:[W1204 13:31:09.740978532 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9531185Z [rank2]:[W1204 13:31:09.782619136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9531371Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9531627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9531806Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9532176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9532377Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9532482Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9532577Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9532675Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9532677Z 2025-12-04T13:44:25.9532910Z [rank2]:[W1204 13:31:09.785010413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9533079Z [rank1]:[W1204 13:31:10.358515287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9533255Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9533512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9533676Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9534046Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9534250Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9534356Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9534451Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9534547Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9534550Z 2025-12-04T13:44:25.9534793Z [rank1]:[W1204 13:31:10.360251008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9534974Z [rank3]:[W1204 13:31:10.741124003 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9535149Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9535417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9535579Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9535957Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9536159Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9536265Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9536362Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9536458Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9536461Z 2025-12-04T13:44:25.9536694Z [rank3]:[W1204 13:31:10.743327725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9536866Z [rank2]:[W1204 13:31:10.785136954 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9537040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9537298Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9537459Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9537864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9538066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9538172Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9538267Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9538363Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9538365Z 2025-12-04T13:44:25.9538600Z [rank2]:[W1204 13:31:10.787215748 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9538795Z [rank1]:[W1204 13:31:11.360418119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9538970Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9539238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9539400Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9539768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9539982Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9540087Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9540182Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9540278Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9540280Z 2025-12-04T13:44:25.9540513Z [rank1]:[W1204 13:31:11.362220799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9540687Z [rank3]:[W1204 13:31:11.743579414 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9540862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9541118Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9541280Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9541645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9541848Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9541951Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9542047Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9542141Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9542143Z 2025-12-04T13:44:25.9542378Z [rank3]:[W1204 13:31:11.745925702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9542550Z [rank2]:[W1204 13:31:11.787368750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9542743Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9543003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9543179Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9543551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9543763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9543870Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9543970Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9544069Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9544071Z 2025-12-04T13:44:25.9544304Z [rank2]:[W1204 13:31:11.789121671 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9544474Z [rank1]:[W1204 13:31:12.362338351 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9544649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9544905Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9545068Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9545438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9545639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9545744Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9545839Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9545939Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9545941Z 2025-12-04T13:44:25.9546173Z [rank1]:[W1204 13:31:12.363672292 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9546346Z [rank3]:[W1204 13:31:12.746070104 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9546521Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9546800Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9546979Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9547355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9547592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9547697Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9547795Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9547891Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9547893Z 2025-12-04T13:44:25.9548128Z [rank3]:[W1204 13:31:12.748452761 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9548298Z [rank2]:[W1204 13:31:12.789313402 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9548473Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9548730Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9548891Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9549260Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9549463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9549569Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9549666Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9549762Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9549765Z 2025-12-04T13:44:25.9549999Z [rank2]:[W1204 13:31:12.791628701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9550168Z [rank1]:[W1204 13:31:13.363797074 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9550345Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9550628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9550791Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9551172Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9551372Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9551489Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9551585Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9551684Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9551685Z 2025-12-04T13:44:25.9551918Z [rank1]:[W1204 13:31:13.365133915 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9552090Z [rank3]:[W1204 13:31:13.748572973 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9552264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9552522Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9552685Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9553050Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9553252Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9553358Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9553454Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9553550Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9553552Z 2025-12-04T13:44:25.9553789Z [rank3]:[W1204 13:31:13.750802974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9553962Z [rank2]:[W1204 13:31:13.791784662 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9554137Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9554393Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9554576Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9554943Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9555152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9555258Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9555367Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9555463Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9555466Z 2025-12-04T13:44:25.9555699Z [rank2]:[W1204 13:31:13.794035513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9555870Z [rank1]:[W1204 13:31:14.365278337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9556046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9556302Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9556468Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9556836Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9557037Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9557142Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9557237Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9557333Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9557335Z 2025-12-04T13:44:25.9557609Z [rank1]:[W1204 13:31:14.367251564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9557780Z [rank3]:[W1204 13:31:14.750928657 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9557954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9558213Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9558377Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9558771Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9558987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9559090Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9559187Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9559296Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9559298Z 2025-12-04T13:44:25.9559532Z [rank3]:[W1204 13:31:14.753235626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9559703Z [rank2]:[W1204 13:31:14.794186445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9559878Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9560133Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9560297Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9560667Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9560870Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9560974Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9561070Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9561168Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9561170Z 2025-12-04T13:44:25.9561407Z [rank2]:[W1204 13:31:14.795731131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9561574Z [rank1]:[W1204 13:31:15.367418845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9561750Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9562003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9562168Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9562566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9562768Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9562883Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9562977Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9563074Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9563087Z 2025-12-04T13:44:25.9563323Z [rank1]:[W1204 13:31:15.369721885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9563494Z [rank3]:[W1204 13:31:15.753406178 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9563668Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9563924Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9564087Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9564458Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9564661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9564766Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9564862Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9564957Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9564961Z 2025-12-04T13:44:25.9565192Z [rank3]:[W1204 13:31:15.754675720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9565365Z [rank2]:[W1204 13:31:15.795911412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9565539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9565796Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9565958Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9566349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9566551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9566665Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9566760Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9566857Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9566860Z 2025-12-04T13:44:25.9567094Z [rank2]:[W1204 13:31:15.797827560 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9567276Z [rank1]:[W1204 13:31:16.369915656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9567451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9567748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9567911Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9568277Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9568480Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9568583Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9568678Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9568775Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9568777Z 2025-12-04T13:44:25.9569012Z [rank1]:[W1204 13:31:16.372320193 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9569187Z [rank3]:[W1204 13:31:16.754883171 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9569362Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9569618Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9569783Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9570147Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9570377Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9570482Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9570590Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9570684Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9570688Z 2025-12-04T13:44:25.9570919Z [rank3]:[W1204 13:31:16.756791489 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9571103Z [rank2]:[W1204 13:31:16.797985353 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9571280Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9571538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9571703Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9572073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9572278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9572382Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9572478Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9572576Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9572578Z 2025-12-04T13:44:25.9572811Z [rank2]:[W1204 13:31:16.799873261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9572981Z [rank1]:[W1204 13:31:17.372452996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9573156Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9573411Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9573576Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9573942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9574152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9574267Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9574361Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9574467Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9574469Z 2025-12-04T13:44:25.9574702Z [rank1]:[W1204 13:31:17.374759616 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9574872Z [rank3]:[W1204 13:31:17.756991411 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9575057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9575315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9575479Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9575854Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9576058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9576162Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9576258Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9576352Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9576356Z 2025-12-04T13:44:25.9576590Z [rank3]:[W1204 13:31:17.758499478 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9576761Z [rank2]:[W1204 13:31:17.800039793 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9576935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9577194Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9577356Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9577762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9577966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9578071Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9578199Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9578295Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9578297Z 2025-12-04T13:44:25.9578546Z [rank2]:[W1204 13:31:17.801733836 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9578716Z [rank1]:[W1204 13:31:18.374909429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9578891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9579160Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9579323Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9579691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9579891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9579998Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9580093Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9580192Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9580194Z 2025-12-04T13:44:25.9580426Z [rank1]:[W1204 13:31:18.376197850 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9580598Z [rank3]:[W1204 13:31:18.758702080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9580774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9581030Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9581193Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9581557Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9581761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9581865Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9581960Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9582077Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9582080Z 2025-12-04T13:44:25.9582317Z [rank3]:[W1204 13:31:18.760509190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9582499Z [rank2]:[W1204 13:31:18.801897959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9582673Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9582938Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9583103Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9583472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9583676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9583780Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9583877Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9583973Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9583975Z 2025-12-04T13:44:25.9584210Z [rank2]:[W1204 13:31:18.804162749 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9584384Z [rank1]:[W1204 13:31:19.376350404 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9584560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9584815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9584981Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9585348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9585551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9585655Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9585751Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9585847Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9585848Z 2025-12-04T13:44:25.9586099Z [rank1]:[W1204 13:31:19.377709774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9586272Z [rank3]:[W1204 13:31:19.760711083 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9586457Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9586713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9586887Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9587255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9587458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9587596Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9587692Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9587791Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9587793Z 2025-12-04T13:44:25.9588027Z [rank3]:[W1204 13:31:19.762537642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9588198Z [rank2]:[W1204 13:31:19.804321733 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9588371Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9588626Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9588789Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9589163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9589366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9589470Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9589566Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9589662Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9589664Z 2025-12-04T13:44:25.9589923Z [rank2]:[W1204 13:31:19.806711530 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9590093Z [rank1]:[W1204 13:31:20.377891997 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9590283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9590537Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9590700Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9591096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9591297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9591403Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9591497Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9591593Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9591596Z 2025-12-04T13:44:25.9591828Z [rank1]:[W1204 13:31:20.380358602 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9592000Z [rank3]:[W1204 13:31:20.762749975 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9592174Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9592430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9592594Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9592961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9593164Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9593270Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9593369Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9593466Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9593468Z 2025-12-04T13:44:25.9593703Z [rank3]:[W1204 13:31:20.764954376 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9593894Z [rank2]:[W1204 13:31:20.806871294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9594068Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9594334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9594496Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9594873Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9595075Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9595179Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9595275Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9595371Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9595372Z 2025-12-04T13:44:25.9595607Z [rank2]:[W1204 13:31:20.809190902 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9595778Z [rank1]:[W1204 13:31:21.380520626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9595954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9596209Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9596373Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9596740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9596942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9597047Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9597143Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9597239Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9597241Z 2025-12-04T13:44:25.9597506Z [rank1]:[W1204 13:31:21.382890794 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9597693Z [rank3]:[W1204 13:31:21.765066071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9597880Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9598136Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9598313Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9598683Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9598900Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9599005Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9599100Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9599196Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9599198Z 2025-12-04T13:44:25.9599430Z [rank3]:[W1204 13:31:21.766574528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9599601Z [rank2]:[W1204 13:31:21.809302757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9599778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9600037Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9600199Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9600566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9600770Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9600873Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9600970Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9601067Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9601069Z 2025-12-04T13:44:25.9601303Z [rank2]:[W1204 13:31:21.811621326 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9601472Z [rank1]:[W1204 13:31:22.383076857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9601657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9601925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9602099Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9602468Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9602680Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9602786Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9602881Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9602981Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9602983Z 2025-12-04T13:44:25.9603216Z [rank1]:[W1204 13:31:22.385503754 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9603385Z [rank3]:[W1204 13:31:22.766758871 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9603560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9603816Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9603979Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9604349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9604552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9604657Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9604753Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9604849Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9604852Z 2025-12-04T13:44:25.9605086Z [rank3]:[W1204 13:31:22.768910484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9605256Z [rank2]:[W1204 13:31:22.811786620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9605432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9605709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9605872Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9606254Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9606460Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9606573Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9606670Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9606766Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9606768Z 2025-12-04T13:44:25.9607002Z [rank2]:[W1204 13:31:22.814076630 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9607171Z [rank1]:[W1204 13:31:23.385670358 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9607346Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9607643Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9611850Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9612220Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9612420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9612531Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9612628Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9612726Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9612728Z 2025-12-04T13:44:25.9612967Z [rank1]:[W1204 13:31:23.387935278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9613139Z [rank3]:[W1204 13:31:23.769074668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9613314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9613597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9613774Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9614139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9614354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9614472Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9614569Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9614668Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9614670Z 2025-12-04T13:44:25.9614906Z [rank3]:[W1204 13:31:23.770781591 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9615078Z [rank2]:[W1204 13:31:23.814378381 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9615253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9615509Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9615674Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9616040Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9616244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9616348Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9616444Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9616541Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9616543Z 2025-12-04T13:44:25.9616777Z [rank2]:[W1204 13:31:23.816590842 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9616949Z [rank1]:[W1204 13:31:24.388094313 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9617126Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9617382Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9617599Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9617978Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9618191Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9618296Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9618405Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9618504Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9618506Z 2025-12-04T13:44:25.9618740Z [rank1]:[W1204 13:31:24.390387882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9618909Z [rank3]:[W1204 13:31:24.770933106 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9619085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9619340Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9619504Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9619871Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9620072Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9620177Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9620271Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9620369Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9620371Z 2025-12-04T13:44:25.9620604Z [rank3]:[W1204 13:31:24.772481851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9620775Z [rank2]:[W1204 13:31:24.816739167 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9620949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9621205Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9621370Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9621762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9621973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9622077Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9622173Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9622279Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9622281Z 2025-12-04T13:44:25.9622516Z [rank2]:[W1204 13:31:24.817949381 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9622686Z [rank1]:[W1204 13:31:25.390564196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9622862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9623118Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9623280Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9623652Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9623857Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9623962Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9624057Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9624153Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9624156Z 2025-12-04T13:44:25.9624391Z [rank1]:[W1204 13:31:25.392583152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9624561Z [rank3]:[W1204 13:31:25.772591737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9624736Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9624992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9625156Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9625546Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9625748Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9625865Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9625960Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9626056Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9626057Z 2025-12-04T13:44:25.9626303Z [rank3]:[W1204 13:31:25.774535455 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9626476Z [rank2]:[W1204 13:31:25.818097246 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9626650Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9626906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9627068Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9627438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9627669Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9627772Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9627869Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9627964Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9627966Z 2025-12-04T13:44:25.9628200Z [rank2]:[W1204 13:31:25.820041223 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9628373Z [rank1]:[W1204 13:31:26.392753977 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9628547Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9628800Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9628962Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9629329Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9629555Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9629661Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9629769Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9629864Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9629866Z 2025-12-04T13:44:25.9630102Z [rank1]:[W1204 13:31:26.395148074 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9630286Z [rank3]:[W1204 13:31:26.774692730 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9630463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9630718Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9630882Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9631250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9631454Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9631559Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9631654Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9631750Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9631755Z 2025-12-04T13:44:25.9631988Z [rank3]:[W1204 13:31:26.776766554 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9632159Z [rank2]:[W1204 13:31:26.820173179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9632336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9632593Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9632757Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9633123Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9633336Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9633448Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9633545Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9633651Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9633653Z 2025-12-04T13:44:25.9633887Z [rank2]:[W1204 13:31:26.822581086 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9634056Z [rank1]:[W1204 13:31:27.395513505 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9634243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9634499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9634665Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9635036Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9635237Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9635344Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9635438Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9635534Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9635537Z 2025-12-04T13:44:25.9635769Z [rank1]:[W1204 13:31:27.396799157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9635938Z [rank3]:[W1204 13:31:27.776925560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9636114Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9636370Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9636533Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9636902Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9637104Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9637218Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9637322Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9637422Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9637433Z 2025-12-04T13:44:25.9637696Z [rank3]:[W1204 13:31:27.778239391 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9637866Z [rank2]:[W1204 13:31:27.822711322 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9638054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9638311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9638472Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9638840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9639044Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9639148Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9639246Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9639342Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9639343Z 2025-12-04T13:44:25.9639577Z [rank2]:[W1204 13:31:27.825072440 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9639747Z [rank1]:[W1204 13:31:28.396985872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9639922Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9640180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9640342Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9640707Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9640907Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9641012Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9641127Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9641236Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9641238Z 2025-12-04T13:44:25.9641472Z [rank1]:[W1204 13:31:28.399288241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9641654Z [rank3]:[W1204 13:31:28.778359447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9641828Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9642097Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9642260Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9642625Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9642828Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9642933Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9643029Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9643126Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9643128Z 2025-12-04T13:44:25.9643360Z [rank3]:[W1204 13:31:28.779915423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9643534Z [rank2]:[W1204 13:31:28.825212856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9643710Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9643963Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9644129Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9644496Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9644699Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9644802Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9644899Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9645004Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9645006Z 2025-12-04T13:44:25.9645249Z [rank2]:[W1204 13:31:28.827355989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9645430Z [rank1]:[W1204 13:31:29.399466336 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9645607Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9645864Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9646037Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9646404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9646604Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9646708Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9646803Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9646899Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9646902Z 2025-12-04T13:44:25.9647136Z [rank1]:[W1204 13:31:29.401704317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9647306Z [rank3]:[W1204 13:31:29.780092719 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9647515Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9647773Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9647937Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9648304Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9648506Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9648610Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9648704Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9648801Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9648803Z 2025-12-04T13:44:25.9649061Z [rank3]:[W1204 13:31:29.781483678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9649232Z [rank2]:[W1204 13:31:29.827479196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9649420Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9649675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9649853Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9650224Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9650428Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9650531Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9650627Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9650723Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9650726Z 2025-12-04T13:44:25.9650962Z [rank2]:[W1204 13:31:29.829766635 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9651131Z [rank1]:[W1204 13:31:30.401899083 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9651306Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9651560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9651721Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9652096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9652296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9652401Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9652496Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9652593Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9652596Z 2025-12-04T13:44:25.9652837Z [rank1]:[W1204 13:31:30.404235421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9653016Z [rank3]:[W1204 13:31:30.781653314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9653192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9653456Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9653619Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9654003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9654206Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9654312Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9654406Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9654502Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9654504Z 2025-12-04T13:44:25.9654737Z [rank3]:[W1204 13:31:30.783223280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9654907Z [rank2]:[W1204 13:31:30.829867813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9655080Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9655339Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9655502Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9655868Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9656070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9656175Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9656270Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9656365Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9656367Z 2025-12-04T13:44:25.9656603Z [rank2]:[W1204 13:31:30.832296660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9656789Z [rank1]:[W1204 13:31:31.404426267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9656971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9657227Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9657400Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9657801Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9658018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9658123Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9658221Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9658316Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9658318Z 2025-12-04T13:44:25.9658553Z [rank1]:[W1204 13:31:31.406757196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9658724Z [rank3]:[W1204 13:31:31.783418005 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9658900Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9659157Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9659321Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9659690Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9659893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9659998Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9660092Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9660189Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9660191Z 2025-12-04T13:44:25.9660424Z [rank3]:[W1204 13:31:31.785217286 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9660596Z [rank2]:[W1204 13:31:31.832417607 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9660797Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9661054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9661230Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9661595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9661809Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9661915Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9662010Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9662106Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9662109Z 2025-12-04T13:44:25.9662341Z [rank2]:[W1204 13:31:31.834432112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9662511Z [rank1]:[W1204 13:31:32.406947611 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9662685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9662943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9663107Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9663472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9663673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9663778Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9663873Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9663968Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9663970Z 2025-12-04T13:44:25.9664204Z [rank1]:[W1204 13:31:32.409358888 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9664373Z [rank3]:[W1204 13:31:32.785405442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9664548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9664826Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9664991Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9665371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9665581Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9665686Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9665781Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9665877Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9665880Z 2025-12-04T13:44:25.9666110Z [rank3]:[W1204 13:31:32.787317210 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9666281Z [rank2]:[W1204 13:31:32.834602879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9666456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9666714Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9666877Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9667245Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9667449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9667596Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9667693Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9667788Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9667790Z 2025-12-04T13:44:25.9668022Z [rank2]:[W1204 13:31:32.836849509 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9668193Z [rank1]:[W1204 13:31:33.409549335 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9668366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9668639Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9668813Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9669181Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9669402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9669521Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9669618Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9669715Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9669716Z 2025-12-04T13:44:25.9669949Z [rank1]:[W1204 13:31:33.411918902 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9670119Z [rank3]:[W1204 13:31:33.787498786 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9670292Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9670547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9670711Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9671077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9671277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9671382Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9671476Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9671573Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9671575Z 2025-12-04T13:44:25.9671809Z [rank3]:[W1204 13:31:33.789318096 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9671980Z [rank2]:[W1204 13:31:33.837021166 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9672152Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9672407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9672591Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9672957Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9673170Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9673273Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9673381Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9673478Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9673481Z 2025-12-04T13:44:25.9673716Z [rank2]:[W1204 13:31:33.839127920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9673889Z [rank1]:[W1204 13:31:34.412084550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9674064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9674319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9674483Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9674850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9675050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9675155Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9675252Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9675347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9675349Z 2025-12-04T13:44:25.9675585Z [rank1]:[W1204 13:31:34.414314981 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9675755Z [rank3]:[W1204 13:31:34.789658310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9675932Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9676190Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9676354Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9676741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9676951Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9677056Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9677149Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9677256Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9677258Z 2025-12-04T13:44:25.9677529Z [rank3]:[W1204 13:31:34.792257002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9677700Z [rank2]:[W1204 13:31:34.839274708 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9677873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9678129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9678294Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9678663Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9678866Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9678970Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9679065Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9679162Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9679166Z 2025-12-04T13:44:25.9679400Z [rank2]:[W1204 13:31:34.840512780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9679570Z [rank1]:[W1204 13:31:35.414489798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9679743Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9679996Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9680158Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9680566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9680770Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9680886Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9680981Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9681076Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9681093Z 2025-12-04T13:44:25.9681326Z [rank1]:[W1204 13:31:35.416730548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9681498Z [rank3]:[W1204 13:31:35.792411760 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9681674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9681929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9682091Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9682461Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9682662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9682768Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9682863Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9682962Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9682964Z 2025-12-04T13:44:25.9683195Z [rank3]:[W1204 13:31:35.794292989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9683368Z [rank2]:[W1204 13:31:35.840666908 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9683542Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9683797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9683960Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9684336Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9684555Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9684669Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9684765Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9684862Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9684865Z 2025-12-04T13:44:25.9685098Z [rank2]:[W1204 13:31:35.841890821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9685281Z [rank1]:[W1204 13:31:36.416988734 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9685456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9685711Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9685873Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9686240Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9686443Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9686546Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9686642Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9686737Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9686739Z 2025-12-04T13:44:25.9686974Z [rank1]:[W1204 13:31:36.419742124 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9687147Z [rank3]:[W1204 13:31:36.794418907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9687325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9687616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9687779Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9688145Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9688372Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9688477Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9688572Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9688682Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9688683Z 2025-12-04T13:44:25.9688917Z [rank3]:[W1204 13:31:36.796682908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9689101Z [rank2]:[W1204 13:31:36.842074559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9689281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9689537Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9689702Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9690067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9690271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9690376Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9690471Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9690568Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9690570Z 2025-12-04T13:44:25.9690804Z [rank2]:[W1204 13:31:36.843923768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9690972Z [rank1]:[W1204 13:31:37.419865522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9691148Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9691406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9691569Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9691935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9692138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9692264Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9692360Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9692454Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9692466Z 2025-12-04T13:44:25.9692700Z [rank1]:[W1204 13:31:37.422314738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9692868Z [rank3]:[W1204 13:31:37.796861386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9693055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9693311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9693473Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9693845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9694048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9694152Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9694247Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9694343Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9694346Z 2025-12-04T13:44:25.9694579Z [rank3]:[W1204 13:31:37.799092406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9694749Z [rank2]:[W1204 13:31:37.844098066 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9694925Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9695179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9695343Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9695712Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9695918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9696022Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9696137Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9696234Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9696236Z 2025-12-04T13:44:25.9696469Z [rank2]:[W1204 13:31:37.845931406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9696650Z [rank1]:[W1204 13:31:38.422436398 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9696823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9697091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9697253Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9697653Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9697856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9697961Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9698058Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9698153Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9698155Z 2025-12-04T13:44:25.9698390Z [rank1]:[W1204 13:31:38.424528622 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9698560Z [rank3]:[W1204 13:31:38.799270754 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9698735Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9698990Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9699152Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9699518Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9699720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9699825Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9699919Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9700034Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9700048Z 2025-12-04T13:44:25.9700287Z [rank3]:[W1204 13:31:38.801453006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9700471Z [rank2]:[W1204 13:31:38.846086364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9700646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9700901Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9701082Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9701449Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9701652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9701756Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9701853Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9701949Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9701952Z 2025-12-04T13:44:25.9702185Z [rank2]:[W1204 13:31:38.847287888 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9702360Z [rank1]:[W1204 13:31:39.424715730 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9702533Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9702788Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9702952Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9703318Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9703520Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9703623Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9703718Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9703814Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9703815Z 2025-12-04T13:44:25.9704065Z [rank1]:[W1204 13:31:39.427005199 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9704235Z [rank3]:[W1204 13:31:39.801574426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9704421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9704678Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9704850Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9705218Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9705420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9705525Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9705621Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9705718Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9705719Z 2025-12-04T13:44:25.9705952Z [rank3]:[W1204 13:31:39.803641450 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9706121Z [rank2]:[W1204 13:31:39.847432527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9706297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9706552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9706716Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9707087Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9707289Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9707393Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9707524Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9707622Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9707626Z 2025-12-04T13:44:25.9707872Z [rank2]:[W1204 13:31:39.849395953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9708054Z [rank1]:[W1204 13:31:40.427162058 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9708228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9708502Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9708664Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9709051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9709253Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9709358Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9709453Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9709548Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9709551Z 2025-12-04T13:44:25.9709785Z [rank1]:[W1204 13:31:40.429762851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9709957Z [rank3]:[W1204 13:31:40.803792750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9710131Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9710390Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9710553Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9710921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9711123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9711228Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9711322Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9711418Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9711420Z 2025-12-04T13:44:25.9711652Z [rank3]:[W1204 13:31:40.806109659 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9711844Z [rank2]:[W1204 13:31:40.849537213 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9712020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9712284Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9712447Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9712813Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9713036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9713140Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9713237Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9713336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9713338Z 2025-12-04T13:44:25.9713572Z [rank2]:[W1204 13:31:40.851510319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9713743Z [rank1]:[W1204 13:31:41.429922700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9713919Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9714174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9714337Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9714704Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9714909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9715012Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9715109Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9715204Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9715206Z 2025-12-04T13:44:25.9715444Z [rank1]:[W1204 13:31:41.432335317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9715614Z [rank3]:[W1204 13:31:41.806303277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9715807Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9716065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9716237Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9716603Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9716816Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9716921Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9717015Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9717112Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9717114Z 2025-12-04T13:44:25.9717347Z [rank3]:[W1204 13:31:41.808269584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9717551Z [rank2]:[W1204 13:31:41.851657329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9717728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9717984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9718147Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9718514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9718717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9718822Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9718918Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9719016Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9719018Z 2025-12-04T13:44:25.9719249Z [rank2]:[W1204 13:31:41.852929621 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9719419Z [rank1]:[W1204 13:31:42.432506286 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9719594Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9719878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9720053Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9720422Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9720635Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9720740Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9720836Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9720931Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9720934Z 2025-12-04T13:44:25.9721166Z [rank1]:[W1204 13:31:42.434136421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9721334Z [rank3]:[W1204 13:31:42.808447523 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9721510Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9721766Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9721927Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9722296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9722496Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9722601Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9722696Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9722793Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9722796Z 2025-12-04T13:44:25.9723028Z [rank3]:[W1204 13:31:42.810397940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9723197Z [rank2]:[W1204 13:31:42.853040892 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9723372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9723647Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9723811Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9724191Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9724396Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9724512Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9724608Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9724705Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9724707Z 2025-12-04T13:44:25.9724939Z [rank2]:[W1204 13:31:42.855035658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9725111Z [rank1]:[W1204 13:31:43.434276991 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9725284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9725539Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9725702Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9726070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9726273Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9726378Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9726474Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9726571Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9726573Z 2025-12-04T13:44:25.9726807Z [rank1]:[W1204 13:31:43.436115520 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9726976Z [rank3]:[W1204 13:31:43.810528041 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9727151Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9727406Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9727640Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9728007Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9728226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9728330Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9728437Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9728534Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9728537Z 2025-12-04T13:44:25.9728773Z [rank3]:[W1204 13:31:43.812765792 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9728943Z [rank2]:[W1204 13:31:43.855170018 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9729117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9729371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9729536Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9729903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9730106Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9730210Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9730305Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9730401Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9730404Z 2025-12-04T13:44:25.9730635Z [rank2]:[W1204 13:31:43.856390061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9730806Z [rank1]:[W1204 13:31:44.436264400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9730979Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9731233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9731396Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9731783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9731994Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9732098Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9732197Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9732302Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9732304Z 2025-12-04T13:44:25.9732538Z [rank1]:[W1204 13:31:44.437545152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9732707Z [rank3]:[W1204 13:31:44.812929011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9732882Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9733140Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9733304Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9733672Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9733873Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9733978Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9734073Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9734171Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9734172Z 2025-12-04T13:44:25.9734406Z [rank3]:[W1204 13:31:44.815617282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9734576Z [rank2]:[W1204 13:31:44.856558021 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9734751Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9735006Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9735173Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9735558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9735761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9735877Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9735973Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9736069Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9736081Z 2025-12-04T13:44:25.9736315Z [rank2]:[W1204 13:31:44.857949821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9736485Z [rank1]:[W1204 13:31:45.437713872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9736660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9736916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9737079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9737450Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9737696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9737802Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9737898Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9737993Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9737996Z 2025-12-04T13:44:25.9738230Z [rank1]:[W1204 13:31:45.438976905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9738400Z [rank3]:[W1204 13:31:45.815816452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9738575Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9738832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9738993Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9739392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9739596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9739716Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9739811Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9739909Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9739911Z 2025-12-04T13:44:25.9740143Z [rank3]:[W1204 13:31:45.817732240 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9740327Z [rank2]:[W1204 13:31:45.858090201 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9740502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9740759Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9740922Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9741289Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9741495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9741600Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9741696Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9741793Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9741795Z 2025-12-04T13:44:25.9742028Z [rank2]:[W1204 13:31:45.859308185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9742199Z [rank1]:[W1204 13:31:46.439138825 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9742373Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9742626Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9742789Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9743153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9743381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9743486Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9743593Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9743691Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9743693Z 2025-12-04T13:44:25.9743930Z [rank1]:[W1204 13:31:46.441549672 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9744110Z [rank3]:[W1204 13:31:46.817901480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9744286Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9744542Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9744703Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9745070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9745274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9745379Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9745474Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9745571Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9745574Z 2025-12-04T13:44:25.9745806Z [rank3]:[W1204 13:31:46.819704970 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9745978Z [rank2]:[W1204 13:31:46.859555983 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9746154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9746410Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9746573Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9746940Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9747152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9747267Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9747363Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9747469Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9747506Z 2025-12-04T13:44:25.9747738Z [rank2]:[W1204 13:31:46.861414472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9747909Z [rank1]:[W1204 13:31:47.441710043 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9748095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9748353Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9748516Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9748883Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9749087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9749191Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9749288Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9749383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9749386Z 2025-12-04T13:44:25.9749619Z [rank1]:[W1204 13:31:47.444133300 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9749788Z [rank3]:[W1204 13:31:47.819882211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9749963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9750222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9750383Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9750757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9750957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9751063Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9751186Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9751283Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9751297Z 2025-12-04T13:44:25.9751530Z [rank3]:[W1204 13:31:47.822219979 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9751699Z [rank2]:[W1204 13:31:47.861549864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9751874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9752142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9752306Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9752676Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9752879Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9752984Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9753080Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9753178Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9753180Z 2025-12-04T13:44:25.9753413Z [rank2]:[W1204 13:31:47.862772957 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9753584Z [rank1]:[W1204 13:31:48.444243572 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9753757Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9754014Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9754178Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9754542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9754745Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9754851Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9754949Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9755064Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9755066Z 2025-12-04T13:44:25.9755299Z [rank1]:[W1204 13:31:48.446469363 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9755478Z [rank3]:[W1204 13:31:48.822635485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9755652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9755917Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9756080Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9756448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9756651Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9756755Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9756853Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9756949Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9756951Z 2025-12-04T13:44:25.9757186Z [rank3]:[W1204 13:31:48.824958764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9757356Z [rank2]:[W1204 13:31:48.862913599 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9757619Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9757872Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9758037Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9758405Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9758608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9758712Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9758808Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9758905Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9758906Z 2025-12-04T13:44:25.9759170Z [rank2]:[W1204 13:31:48.865083321 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9759341Z [rank1]:[W1204 13:31:49.446601984 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9759527Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9759785Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9759960Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9760325Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9760528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9760632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9760727Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9760823Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9760825Z 2025-12-04T13:44:25.9761060Z [rank1]:[W1204 13:31:49.448970162 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9761230Z [rank3]:[W1204 13:31:49.825114635 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9761405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9761666Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9761828Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9762195Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9762397Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9762502Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9762597Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9762693Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9762695Z 2025-12-04T13:44:25.9762946Z [rank3]:[W1204 13:31:49.827428375 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9763116Z [rank2]:[W1204 13:31:49.865212323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9763304Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9763559Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9763723Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9764105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9764307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9764414Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9764509Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9764606Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9764609Z 2025-12-04T13:44:25.9764843Z [rank2]:[W1204 13:31:49.866560313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9765014Z [rank1]:[W1204 13:31:50.449098645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9765189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9765445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9765607Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9765980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9766183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9766286Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9766381Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9766477Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9766479Z 2025-12-04T13:44:25.9766712Z [rank1]:[W1204 13:31:50.451409354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9766903Z [rank3]:[W1204 13:31:50.827583226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9767078Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9767347Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9767540Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9767923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9768125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9768230Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9768326Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9768421Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9768422Z 2025-12-04T13:44:25.9768657Z [rank3]:[W1204 13:31:50.829596172 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9768828Z [rank2]:[W1204 13:31:50.866665506 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9769001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9769253Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9769416Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9769782Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9769984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9770090Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9770186Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9770286Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9770288Z 2025-12-04T13:44:25.9770520Z [rank2]:[W1204 13:31:50.867927658 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9770707Z [rank1]:[W1204 13:31:51.451570296 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9770894Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9771150Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9771328Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9771694Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9771913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9772016Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9772113Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9772208Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9772210Z 2025-12-04T13:44:25.9772446Z [rank1]:[W1204 13:31:51.454009552 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9772618Z [rank3]:[W1204 13:31:51.829774974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9772792Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9773047Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9773211Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9773577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9773781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9773883Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9773979Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9774074Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9774076Z 2025-12-04T13:44:25.9774316Z [rank3]:[W1204 13:31:51.831738820 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9774485Z [rank2]:[W1204 13:31:51.868126540 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9774675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9774939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9775111Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9775476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9775686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9775793Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9775887Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9775985Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9775987Z 2025-12-04T13:44:25.9776219Z [rank2]:[W1204 13:31:51.870512477 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9776392Z [rank1]:[W1204 13:31:52.454141585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9776568Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9776824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9776987Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9777354Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9777608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9777713Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9777809Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9777904Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9777908Z 2025-12-04T13:44:25.9778140Z [rank1]:[W1204 13:31:52.455376708 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9778311Z [rank3]:[W1204 13:31:52.831929262 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9778486Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9778768Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9778931Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9779313Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9779515Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9779633Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9779730Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9779824Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9779826Z 2025-12-04T13:44:25.9780060Z [rank3]:[W1204 13:31:52.834062065 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9780229Z [rank2]:[W1204 13:31:52.870621240 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9780404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9780661Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9780826Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9781200Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9781403Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9781509Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9781604Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9781702Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9781703Z 2025-12-04T13:44:25.9781938Z [rank2]:[W1204 13:31:52.872794413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9782109Z [rank1]:[W1204 13:31:53.455524420 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9782284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9782550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9782723Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9783095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9783306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9783420Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9783515Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9783611Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9783615Z 2025-12-04T13:44:25.9783847Z [rank1]:[W1204 13:31:53.457991716 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9784018Z [rank3]:[W1204 13:31:53.834255567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9784191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9784446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9784610Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9784976Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9785179Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9785283Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9785380Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9785476Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9785477Z 2025-12-04T13:44:25.9785711Z [rank3]:[W1204 13:31:53.836425119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9785881Z [rank2]:[W1204 13:31:53.872899106 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9786056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9786310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9786486Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9786862Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9787081Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9787185Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9787290Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9787387Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9787389Z 2025-12-04T13:44:25.9787661Z [rank2]:[W1204 13:31:53.874965711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9787832Z [rank1]:[W1204 13:31:54.458134779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9788007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9788263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9788428Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9788794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9788996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9789100Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9789196Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9789292Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9789295Z 2025-12-04T13:44:25.9789533Z [rank1]:[W1204 13:31:54.460627244 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9789703Z [rank3]:[W1204 13:31:54.836625061 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9789878Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9790134Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9790297Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9790691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9790908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9791012Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9791107Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9791216Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9791218Z 2025-12-04T13:44:25.9791455Z [rank3]:[W1204 13:31:54.838708725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9791625Z [rank2]:[W1204 13:31:54.875130934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9791805Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9792058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9792221Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9792589Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9792791Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9792897Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9792991Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9793088Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9793090Z 2025-12-04T13:44:25.9793323Z [rank2]:[W1204 13:31:54.877019912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9793496Z [rank1]:[W1204 13:31:55.460801537 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9793670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9793923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9794085Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9794468Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9794669Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9794783Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9794878Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9794975Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9794977Z 2025-12-04T13:44:25.9795219Z [rank1]:[W1204 13:31:55.462808343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9795391Z [rank3]:[W1204 13:31:55.838896497 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9795564Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9795823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9795984Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9796353Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9796554Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9796658Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9796754Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9796848Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9796850Z 2025-12-04T13:44:25.9797083Z [rank3]:[W1204 13:31:55.841221516 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9797255Z [rank2]:[W1204 13:31:55.877166085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9797429Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9797736Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9797902Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9798271Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9798498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9798603Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9798710Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9798807Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9798810Z 2025-12-04T13:44:25.9799043Z [rank2]:[W1204 13:31:55.879247579 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9799226Z [rank1]:[W1204 13:31:56.463006985 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9799402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9799658Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9799822Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9800194Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9800399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9800502Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9800599Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9800695Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9800697Z 2025-12-04T13:44:25.9800929Z [rank1]:[W1204 13:31:56.464654989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9801102Z [rank3]:[W1204 13:31:56.841414979 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9801277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9801534Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9801698Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9802064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9802278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9802399Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9802496Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9802601Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9802603Z 2025-12-04T13:44:25.9802837Z [rank3]:[W1204 13:31:56.842814938 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9803006Z [rank2]:[W1204 13:31:56.879369484 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9803193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9803447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9803611Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9803979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9804180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9804286Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9804382Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9804480Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9804483Z 2025-12-04T13:44:25.9804716Z [rank2]:[W1204 13:31:56.881444388 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9804887Z [rank1]:[W1204 13:31:57.464790943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9805062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9805319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9805481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9805847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9806050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9806162Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9806269Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9806367Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9806379Z 2025-12-04T13:44:25.9806618Z [rank1]:[W1204 13:31:57.467094422 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9806790Z [rank3]:[W1204 13:31:57.843030480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9806974Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9807232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9807392Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9807786Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9807987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9808093Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9808191Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9808287Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9808289Z 2025-12-04T13:44:25.9808523Z [rank3]:[W1204 13:31:57.845205362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9808692Z [rank2]:[W1204 13:31:57.881596872 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9808867Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9809127Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9809289Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9809657Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9809858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9809965Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9810077Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9810188Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9810190Z 2025-12-04T13:44:25.9810423Z [rank2]:[W1204 13:31:57.883750914 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9810606Z [rank1]:[W1204 13:31:58.467222506 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9810780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9811051Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9811216Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9811580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9811783Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9811888Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9811983Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9812081Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9812083Z 2025-12-04T13:44:25.9812316Z [rank1]:[W1204 13:31:58.469148064 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9812487Z [rank3]:[W1204 13:31:58.845400055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9812661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9812916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9813080Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9813448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9813652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9813756Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9813854Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9813959Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9813961Z 2025-12-04T13:44:25.9814206Z [rank3]:[W1204 13:31:58.847506599 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9814387Z [rank2]:[W1204 13:31:58.883871559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9814562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9814819Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9814993Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9815365Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9815570Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9815675Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9815772Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9815869Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9815870Z 2025-12-04T13:44:25.9816105Z [rank2]:[W1204 13:31:58.886133399 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9816274Z [rank1]:[W1204 13:31:59.469308298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9816450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9816704Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9816869Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9817237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9817440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9817581Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9817676Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9817774Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9817776Z 2025-12-04T13:44:25.9818041Z [rank1]:[W1204 13:31:59.471543849 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9818211Z [rank3]:[W1204 13:31:59.847712152 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9818399Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9818656Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9818831Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9819201Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9819402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9819505Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9819601Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9819696Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9819699Z 2025-12-04T13:44:25.9819934Z [rank3]:[W1204 13:31:59.849542132 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9820103Z [rank2]:[W1204 13:31:59.886256744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9820279Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9820532Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9820699Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9821071Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9821272Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9821377Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9821472Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9821569Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9821571Z 2025-12-04T13:44:25.9821816Z [rank2]:[W1204 13:31:59.887977136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9821995Z [rank1]:[W1204 13:32:00.471725233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9822170Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9822432Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9822595Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9822977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9823180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9823285Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9823380Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9823476Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9823478Z 2025-12-04T13:44:25.9823712Z [rank1]:[W1204 13:32:00.473948464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9823884Z [rank3]:[W1204 13:32:00.849743945 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9824057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9824314Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9824475Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9824841Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9825044Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9825148Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9825245Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9825344Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9825345Z 2025-12-04T13:44:25.9825581Z [rank3]:[W1204 13:32:00.852098683 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9825763Z [rank2]:[W1204 13:32:00.888272128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9825950Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9826205Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9826379Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9826744Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9826957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9827062Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9827158Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9827255Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9827257Z 2025-12-04T13:44:25.9827531Z [rank2]:[W1204 13:32:00.890479329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9827702Z [rank1]:[W1204 13:32:01.474146577 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9827879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9828134Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9828300Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9828665Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9828869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9828974Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9829070Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9829166Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9829168Z 2025-12-04T13:44:25.9829401Z [rank1]:[W1204 13:32:01.476599273 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9829575Z [rank3]:[W1204 13:32:01.852293647 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9829776Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9830033Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9830208Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9830576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9830796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9830900Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9830996Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9831092Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9831094Z 2025-12-04T13:44:25.9831327Z [rank3]:[W1204 13:32:01.854683735 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9831496Z [rank2]:[W1204 13:32:01.890607634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9831674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9831933Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9832095Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9832463Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9832664Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9832770Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9832865Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9832962Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9832965Z 2025-12-04T13:44:25.9833199Z [rank2]:[W1204 13:32:01.892907014 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9833367Z [rank1]:[W1204 13:32:02.476751278 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9833542Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9833815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9833979Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9834355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9834568Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9834674Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9834770Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9834866Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9834869Z 2025-12-04T13:44:25.9835101Z [rank1]:[W1204 13:32:02.479132716 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9835269Z [rank3]:[W1204 13:32:02.854855469 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9835443Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9835701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9835863Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9836231Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9836432Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9836537Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9836633Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9836727Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9836729Z 2025-12-04T13:44:25.9836964Z [rank3]:[W1204 13:32:02.857157409 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9837134Z [rank2]:[W1204 13:32:02.893037379 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9837309Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9837619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9837794Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9838163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9838379Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9838498Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9838592Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9838690Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9838692Z 2025-12-04T13:44:25.9838926Z [rank2]:[W1204 13:32:02.895246251 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9839097Z [rank1]:[W1204 13:32:03.479326890 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9839271Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9839525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9839689Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9840057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9840262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9840367Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9840462Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9840559Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9840561Z 2025-12-04T13:44:25.9840793Z [rank1]:[W1204 13:32:03.481623579 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9840965Z [rank3]:[W1204 13:32:03.857347463 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9841138Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9841395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9841577Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9841945Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9842156Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9842259Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9842366Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9842461Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9842464Z 2025-12-04T13:44:25.9842699Z [rank3]:[W1204 13:32:03.859318730 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9842869Z [rank2]:[W1204 13:32:03.895376916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9843042Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9843298Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9843461Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9843828Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9844028Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9844135Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9844230Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9844329Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9844332Z 2025-12-04T13:44:25.9844572Z [rank2]:[W1204 13:32:03.897538539 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9844742Z [rank1]:[W1204 13:32:04.481766735 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9844917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9845170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9845334Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9845721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9845942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9846046Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9846140Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9846248Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9846250Z 2025-12-04T13:44:25.9846484Z [rank1]:[W1204 13:32:04.484133203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9846656Z [rank3]:[W1204 13:32:04.859476985 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9846831Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9847086Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9847250Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9847666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9847868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9847971Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9848067Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9848162Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9848165Z 2025-12-04T13:44:25.9848400Z [rank3]:[W1204 13:32:04.861348544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9848570Z [rank2]:[W1204 13:32:04.897675205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9848746Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9849004Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9849167Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9849563Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9849764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9849881Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9849977Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9850073Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9850089Z 2025-12-04T13:44:25.9850323Z [rank2]:[W1204 13:32:04.899780468 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9850493Z [rank1]:[W1204 13:32:05.484284449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9850669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9850925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9851088Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9851460Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9851662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9851768Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9851862Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9851958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9851960Z 2025-12-04T13:44:25.9852193Z [rank1]:[W1204 13:32:05.486671006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9852365Z [rank3]:[W1204 13:32:05.861517859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9852539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9852798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9852960Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9853339Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9853554Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9853668Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9853764Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9853859Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9853860Z 2025-12-04T13:44:25.9854092Z [rank3]:[W1204 13:32:05.863928476 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9854273Z [rank2]:[W1204 13:32:05.899915534 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9854447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9854702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9854865Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9855233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9855437Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9855542Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9855640Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9855735Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9855737Z 2025-12-04T13:44:25.9855970Z [rank2]:[W1204 13:32:05.902345971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9856139Z [rank1]:[W1204 13:32:06.486858921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9856315Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9856569Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9856734Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9857100Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9857324Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9857430Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9857565Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9857674Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9857675Z 2025-12-04T13:44:25.9857907Z [rank1]:[W1204 13:32:06.489239129 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9858092Z [rank3]:[W1204 13:32:06.864064163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9858267Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9858523Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9858688Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9862808Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9863021Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9863128Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9863226Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9863323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9863325Z 2025-12-04T13:44:25.9863561Z [rank3]:[W1204 13:32:06.866323823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9863732Z [rank2]:[W1204 13:32:06.902459468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9863907Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9864165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9864330Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9864699Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9864902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9865071Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9865167Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9865263Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9865277Z 2025-12-04T13:44:25.9865512Z [rank2]:[W1204 13:32:06.904672469 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9865682Z [rank1]:[W1204 13:32:07.489414995 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9865873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9866129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9866293Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9866662Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9866864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9866969Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9867065Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9867162Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9867164Z 2025-12-04T13:44:25.9867401Z [rank1]:[W1204 13:32:07.491191706 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9867625Z [rank3]:[W1204 13:32:07.866496679 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9867800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9868058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9868221Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9868588Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9868791Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9868896Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9869021Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9869117Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9869119Z 2025-12-04T13:44:25.9869353Z [rank3]:[W1204 13:32:07.868905336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9869540Z [rank2]:[W1204 13:32:07.904790606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9869713Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9869990Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9870155Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9870524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9870727Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9870833Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9870930Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9871027Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9871029Z 2025-12-04T13:44:25.9871262Z [rank2]:[W1204 13:32:07.906738724 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9871433Z [rank1]:[W1204 13:32:08.491380882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9871607Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9871863Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9872026Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9872395Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9872599Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9872705Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9872799Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9872905Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9872916Z 2025-12-04T13:44:25.9873148Z [rank1]:[W1204 13:32:08.493366068 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9873329Z [rank3]:[W1204 13:32:08.869102331 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9873502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9873759Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9873933Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9874299Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9874501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9874604Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9874702Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9874798Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9874801Z 2025-12-04T13:44:25.9875035Z [rank3]:[W1204 13:32:08.870995220 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9875205Z [rank2]:[W1204 13:32:08.906896920 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9875378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9875634Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9875798Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9876167Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9876368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9876473Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9876569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9876666Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9876668Z 2025-12-04T13:44:25.9876922Z [rank2]:[W1204 13:32:08.908724230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9877092Z [rank1]:[W1204 13:32:09.493558884 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9877277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9877556Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9877737Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9878102Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9878304Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9878409Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9878503Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9878600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9878602Z 2025-12-04T13:44:25.9878835Z [rank1]:[W1204 13:32:09.495920382 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9879011Z [rank3]:[W1204 13:32:09.871188936 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9879185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9879441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9879605Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9879973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9880175Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9880279Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9880374Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9880469Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9880472Z 2025-12-04T13:44:25.9880725Z [rank3]:[W1204 13:32:09.873340848 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9880910Z [rank2]:[W1204 13:32:09.908883916 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9881102Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9881362Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9881525Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9881903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9882104Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9882208Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9882304Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9882400Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9882403Z 2025-12-04T13:44:25.9882640Z [rank2]:[W1204 13:32:09.910560050 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9882810Z [rank1]:[W1204 13:32:10.496095878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9882985Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9883244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9883410Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9883780Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9883979Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9884084Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9884178Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9884276Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9884278Z 2025-12-04T13:44:25.9884511Z [rank1]:[W1204 13:32:10.497371460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9884700Z [rank3]:[W1204 13:32:10.873508135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9884874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9885145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9885309Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9885675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9885888Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9885992Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9886088Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9886183Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9886186Z 2025-12-04T13:44:25.9886420Z [rank3]:[W1204 13:32:10.874744498 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9886591Z [rank2]:[W1204 13:32:10.910675427 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9886766Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9887023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9887186Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9887581Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9887786Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9887889Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9887986Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9888081Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9888083Z 2025-12-04T13:44:25.9888317Z [rank2]:[W1204 13:32:10.911899481 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9888487Z [rank1]:[W1204 13:32:11.497521897 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9888686Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9888942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9889119Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9889489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9889707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9889811Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9889906Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9890003Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9890004Z 2025-12-04T13:44:25.9890236Z [rank1]:[W1204 13:32:11.498767690 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9890408Z [rank3]:[W1204 13:32:11.874943644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9890583Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9890839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9891002Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9891367Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9891573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9891678Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9891774Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9891870Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9891873Z 2025-12-04T13:44:25.9892105Z [rank3]:[W1204 13:32:11.876726255 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9892277Z [rank2]:[W1204 13:32:11.912074767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9892453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9892729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9892901Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9893269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9893487Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9893593Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9893688Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9893784Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9893787Z 2025-12-04T13:44:25.9894021Z [rank2]:[W1204 13:32:11.913563705 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9894190Z [rank1]:[W1204 13:32:12.498922367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9894364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9894619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9894782Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9895151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9895351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9895456Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9895552Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9895648Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9895651Z 2025-12-04T13:44:25.9895885Z [rank1]:[W1204 13:32:12.500156470 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9896056Z [rank3]:[W1204 13:32:12.876921072 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9896234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9896506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9896669Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9897043Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9897245Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9897360Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9897457Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9897588Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9897591Z 2025-12-04T13:44:25.9897823Z [rank3]:[W1204 13:32:12.879101874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9897994Z [rank2]:[W1204 13:32:12.913699173 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9898168Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9898426Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9898589Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9898956Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9899159Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9899264Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9899361Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9899458Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9899460Z 2025-12-04T13:44:25.9899693Z [rank2]:[W1204 13:32:12.915914214 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9899863Z [rank1]:[W1204 13:32:13.500526003 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9900039Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9900296Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9900492Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9900860Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9901073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9901178Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9901285Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9901382Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9901385Z 2025-12-04T13:44:25.9901618Z [rank1]:[W1204 13:32:13.501971251 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9901790Z [rank3]:[W1204 13:32:13.879253152 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9901964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9902218Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9902383Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9902752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9902955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9903058Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9903155Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9903252Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9903254Z 2025-12-04T13:44:25.9903489Z [rank3]:[W1204 13:32:13.881506962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9903660Z [rank2]:[W1204 13:32:13.916052442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9903833Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9904089Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9904253Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9904641Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9904855Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9904959Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9905055Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9905164Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9905166Z 2025-12-04T13:44:25.9905401Z [rank2]:[W1204 13:32:13.918655785 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9905571Z [rank1]:[W1204 13:32:14.502126519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9905748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9906002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9906165Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9906533Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9906734Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9906839Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9906934Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9907031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9907033Z 2025-12-04T13:44:25.9907268Z [rank1]:[W1204 13:32:14.503386782 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9907439Z [rank3]:[W1204 13:32:14.881660830 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9907647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9907901Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9908064Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9908461Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9908663Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9908789Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9908885Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9908981Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9908996Z 2025-12-04T13:44:25.9909232Z [rank3]:[W1204 13:32:14.883781953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9909403Z [rank2]:[W1204 13:32:14.918907211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9909577Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9909833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9909996Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9910365Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9910567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9910671Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9910767Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9910863Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9910866Z 2025-12-04T13:44:25.9911099Z [rank2]:[W1204 13:32:14.921398956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9911270Z [rank1]:[W1204 13:32:15.503494211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9911445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9911699Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9911863Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9912253Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9912454Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9912569Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9912662Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9912758Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9912760Z 2025-12-04T13:44:25.9912991Z [rank1]:[W1204 13:32:15.504662875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9913173Z [rank3]:[W1204 13:32:15.883940602 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9913349Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9913607Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9913770Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9914137Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9914339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9914443Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9914539Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9914635Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9914637Z 2025-12-04T13:44:25.9914868Z [rank3]:[W1204 13:32:15.885329681 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9915039Z [rank2]:[W1204 13:32:15.921543474 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9915212Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9915472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9915637Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9916003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9916230Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9916334Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9916444Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9916539Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9916541Z 2025-12-04T13:44:25.9916774Z [rank2]:[W1204 13:32:15.922793337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9916957Z [rank1]:[W1204 13:32:16.504819703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9917132Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9917387Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9917593Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9917965Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9918169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9918273Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9918367Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9918465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9918467Z 2025-12-04T13:44:25.9918699Z [rank1]:[W1204 13:32:16.506444368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9918872Z [rank3]:[W1204 13:32:16.885507059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9919048Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9919303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9919467Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9919833Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9920051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9920170Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9920265Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9920373Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9920375Z 2025-12-04T13:44:25.9920607Z [rank3]:[W1204 13:32:16.887724740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9920777Z [rank2]:[W1204 13:32:16.922939126 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9920964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9921221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9921383Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9921751Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9921954Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9922059Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9922155Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9922252Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9922254Z 2025-12-04T13:44:25.9922489Z [rank2]:[W1204 13:32:16.924157649 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9922658Z [rank1]:[W1204 13:32:17.506596106 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9922834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9923091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9923251Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9923620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9923820Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9923926Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9924046Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9924143Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9924157Z 2025-12-04T13:44:25.9924396Z [rank1]:[W1204 13:32:17.508213251 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9924565Z [rank3]:[W1204 13:32:17.887866649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9924740Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9925007Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9925171Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9925538Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9925739Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9925845Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9925941Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9926040Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9926042Z 2025-12-04T13:44:25.9926275Z [rank3]:[W1204 13:32:17.889591431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9926448Z [rank2]:[W1204 13:32:17.924275368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9926623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9926880Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9927045Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9927416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9927665Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9927770Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9927866Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9927990Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9927992Z 2025-12-04T13:44:25.9928226Z [rank2]:[W1204 13:32:17.925626619 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9928408Z [rank1]:[W1204 13:32:18.508397749 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9928583Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9928853Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9929016Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9929382Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9929583Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9929688Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9929784Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9929882Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9929884Z 2025-12-04T13:44:25.9930116Z [rank1]:[W1204 13:32:18.510567301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9930286Z [rank3]:[W1204 13:32:18.889762110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9930461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9930716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9930881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9931249Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9931450Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9931554Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9931650Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9931747Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9931749Z 2025-12-04T13:44:25.9932003Z [rank3]:[W1204 13:32:18.891714197 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9932175Z [rank2]:[W1204 13:32:18.925805567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9932361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9932616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9932792Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9933161Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9933364Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9933468Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9933564Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9933660Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9933662Z 2025-12-04T13:44:25.9933897Z [rank2]:[W1204 13:32:18.927851412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9934067Z [rank1]:[W1204 13:32:19.510722141 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9934242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9934498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9934662Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9935032Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9935234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9935339Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9935434Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9935531Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9935533Z 2025-12-04T13:44:25.9935790Z [rank1]:[W1204 13:32:19.512682197 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9935960Z [rank3]:[W1204 13:32:19.891875486 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9936145Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9936398Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9936560Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9936936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9937140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9937246Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9937341Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9937436Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9937439Z 2025-12-04T13:44:25.9937708Z [rank3]:[W1204 13:32:19.894041768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9937878Z [rank2]:[W1204 13:32:19.927969212 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9938051Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9938308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9938471Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9938839Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9939041Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9939147Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9939243Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9939340Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9939343Z 2025-12-04T13:44:25.9939576Z [rank2]:[W1204 13:32:19.929356812 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9939777Z [rank1]:[W1204 13:32:20.512860806 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9939951Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9940221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9940382Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9940763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9940964Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9941074Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9941169Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9941266Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9941268Z 2025-12-04T13:44:25.9941506Z [rank1]:[W1204 13:32:20.514560539 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9941677Z [rank3]:[W1204 13:32:20.894209688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9941851Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9942106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9942269Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9942634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9942838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9942942Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9943038Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9943134Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9943136Z 2025-12-04T13:44:25.9943369Z [rank3]:[W1204 13:32:20.896583816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9943555Z [rank2]:[W1204 13:32:20.929488772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9943743Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9944000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9944175Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9944542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9944757Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9944860Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9944957Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9945054Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9945055Z 2025-12-04T13:44:25.9945289Z [rank2]:[W1204 13:32:20.931565526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9945459Z [rank1]:[W1204 13:32:21.514731968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9945634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9945891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9946054Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9946420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9946623Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9946728Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9946824Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9946920Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9946921Z 2025-12-04T13:44:25.9947155Z [rank1]:[W1204 13:32:21.516947770 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9947324Z [rank3]:[W1204 13:32:21.896751105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9947533Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9947802Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9947980Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9948354Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9948568Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9948673Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9948768Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9948865Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9948867Z 2025-12-04T13:44:25.9949101Z [rank3]:[W1204 13:32:21.898573365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9949270Z [rank2]:[W1204 13:32:21.931695676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9949445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9949701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9949864Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9950237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9950440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9950545Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9950643Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9950739Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9950742Z 2025-12-04T13:44:25.9950975Z [rank2]:[W1204 13:32:21.933448588 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9951146Z [rank1]:[W1204 13:32:22.517106779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9951320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9951594Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9951756Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9952133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9952335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9952457Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9952553Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9952648Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9952650Z 2025-12-04T13:44:25.9952883Z [rank1]:[W1204 13:32:22.519267302 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9953051Z [rank3]:[W1204 13:32:22.898731105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9953225Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9953483Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9953646Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9954013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9954212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9954322Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9954418Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9954516Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9954518Z 2025-12-04T13:44:25.9954753Z [rank3]:[W1204 13:32:22.899964858 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9954923Z [rank2]:[W1204 13:32:22.933617238 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9955096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9955361Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9955534Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9955900Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9956113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9956228Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9956324Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9956422Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9956424Z 2025-12-04T13:44:25.9956661Z [rank2]:[W1204 13:32:22.935324990 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9956834Z [rank1]:[W1204 13:32:23.519455001 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9957008Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9957263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9957426Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9957818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9958022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9958127Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9958223Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9958320Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9958322Z 2025-12-04T13:44:25.9958556Z [rank1]:[W1204 13:32:23.521063916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9958728Z [rank3]:[W1204 13:32:23.900118088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9958902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9959159Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9959345Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9959732Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9959945Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9960050Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9960165Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9960261Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9960263Z 2025-12-04T13:44:25.9960495Z [rank3]:[W1204 13:32:23.902131384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9960666Z [rank2]:[W1204 13:32:23.935488860 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9960841Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9961099Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9961266Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9961633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9961836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9961940Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9962036Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9962133Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9962135Z 2025-12-04T13:44:25.9962370Z [rank2]:[W1204 13:32:24.937261721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9962541Z [rank1]:[W1204 13:32:24.521201077 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9962715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9962973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9963137Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9963536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9963753Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9963858Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9963953Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9964063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9964065Z 2025-12-04T13:44:25.9964303Z [rank1]:[W1204 13:32:24.523439317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9964473Z [rank3]:[W1204 13:32:24.902289304 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9964648Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9964904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9965068Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9965437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9965639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9965746Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9965840Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9965936Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9965939Z 2025-12-04T13:44:25.9966174Z [rank3]:[W1204 13:32:24.903919528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9966346Z [rank2]:[W1204 13:32:25.937433591 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9966520Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9966779Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9966942Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9967331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9967565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9967711Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9967808Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9967904Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9967907Z 2025-12-04T13:44:25.9968161Z [rank2]:[W1204 13:32:25.938912449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9968334Z [rank1]:[W1204 13:32:25.523600178 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9968508Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9968764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9968926Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9969296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9969498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9969604Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9969700Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9969795Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9969797Z 2025-12-04T13:44:25.9970036Z [rank1]:[W1204 13:32:25.525379209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9970209Z [rank3]:[W1204 13:32:25.904248065 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9970384Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9970641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9970805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9971169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9972507Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9972614Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9972709Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9972806Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9972807Z 2025-12-04T13:44:25.9973039Z [rank3]:[W1204 13:32:25.905637825 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9973222Z [rank2]:[W1204 13:32:26.939080449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9973398Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9973669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9973833Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9974201Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9974407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9974512Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9974607Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9974704Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9974706Z 2025-12-04T13:44:25.9974938Z [rank2]:[W1204 13:32:26.940998017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9975109Z [rank1]:[W1204 13:32:26.525562059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9975284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9975543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9975706Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9976072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9976298Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9976454Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9976551Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9976646Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9976650Z 2025-12-04T13:44:25.9976881Z [rank1]:[W1204 13:32:26.527672913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9977051Z [rank3]:[W1204 13:32:26.905834045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9977239Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9977623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9977786Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9978154Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9978357Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9978463Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9978560Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9978656Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9978657Z 2025-12-04T13:44:25.9978890Z [rank3]:[W1204 13:32:26.907381731 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9979059Z [rank2]:[W1204 13:32:27.941127799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9979236Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9979493Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9979658Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9980029Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9980232Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9980356Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9980483Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9980581Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9980583Z 2025-12-04T13:44:25.9980815Z [rank2]:[W1204 13:32:27.943383589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9980986Z [rank1]:[W1204 13:32:27.528125257 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9981181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9981438Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9981602Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9981969Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9982172Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9982280Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9982377Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9982473Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9982476Z 2025-12-04T13:44:25.9982707Z [rank1]:[W1204 13:32:27.530564804 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9982877Z [rank3]:[W1204 13:32:27.907541712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9983050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9983310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9983472Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9983844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9984051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9984158Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9984266Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9984391Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9984393Z 2025-12-04T13:44:25.9984627Z [rank3]:[W1204 13:32:27.908763355 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9984795Z [rank2]:[W1204 13:32:28.943537780 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9984970Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9985239Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9985404Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9985770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9985971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9986078Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9986173Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9986272Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9986275Z 2025-12-04T13:44:25.9986507Z [rank2]:[W1204 13:32:28.945286352 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9986677Z [rank1]:[W1204 13:32:28.530741425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9986852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9987105Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9987270Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9987685Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9987885Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9987990Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9988086Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9988196Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9988217Z 2025-12-04T13:44:25.9988467Z [rank1]:[W1204 13:32:28.532566984 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9988637Z [rank3]:[W1204 13:32:28.908903817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9988811Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9989068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9989245Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9989611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9989818Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9989923Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9990020Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9990116Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9990118Z 2025-12-04T13:44:25.9990352Z [rank3]:[W1204 13:32:28.910096471 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9990525Z [rank2]:[W1204 13:32:29.945434623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9990700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9990957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9991124Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9991494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9991695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9991800Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9991896Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9991993Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9991994Z 2025-12-04T13:44:25.9992264Z [rank2]:[W1204 13:32:29.947278413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9992448Z [rank1]:[W1204 13:32:29.532746286 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9992621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9992882Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9993057Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9993423Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9993626Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9993731Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9993826Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9993922Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9993925Z 2025-12-04T13:44:25.9994161Z [rank1]:[W1204 13:32:29.534178854 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9994330Z [rank3]:[W1204 13:32:29.910275922 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9994504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9994760Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9994923Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9995291Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9995492Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9995598Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9995694Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9995789Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9995792Z 2025-12-04T13:44:25.9996041Z [rank3]:[W1204 13:32:29.912089472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9996234Z [rank2]:[W1204 13:32:30.947475164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9996408Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9996663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9996826Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9997206Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9997411Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9997556Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9997654Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9997750Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9997752Z 2025-12-04T13:44:25.9997988Z [rank2]:[W1204 13:32:30.949462160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:25.9998161Z [rank1]:[W1204 13:32:30.534339206 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:25.9998336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:25.9998590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:25.9998751Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9999119Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:25.9999322Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:25.9999429Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:25.9999524Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9999620Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:25.9999622Z 2025-12-04T13:44:25.9999855Z [rank1]:[W1204 13:32:30.536432280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0000065Z [rank3]:[W1204 13:32:30.912283853 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0000268Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0000525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0000689Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0001056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0001277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0001383Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0001477Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0001574Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0001576Z 2025-12-04T13:44:26.0001807Z [rank3]:[W1204 13:32:30.914282309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0001979Z [rank2]:[W1204 13:32:31.949638101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0002153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0002413Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0002577Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0002942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0003146Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0003252Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0003348Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0003445Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0003447Z 2025-12-04T13:44:26.0003680Z [rank2]:[W1204 13:32:31.951679397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0003851Z [rank1]:[W1204 13:32:31.536597602 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0004050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0004315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0004477Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0004845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0005065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0005172Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0005267Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0005362Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0005364Z 2025-12-04T13:44:26.0005599Z [rank1]:[W1204 13:32:31.539069387 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0005770Z [rank3]:[W1204 13:32:31.914456821 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0005946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0006202Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0006366Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0006736Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0006937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0007043Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0007139Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0007236Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0007238Z 2025-12-04T13:44:26.0007520Z [rank3]:[W1204 13:32:31.916642643 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0007693Z [rank2]:[W1204 13:32:32.951851619 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0007868Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0008180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0008356Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0008721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0008939Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0009044Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0009142Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0009240Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0009242Z 2025-12-04T13:44:26.0009473Z [rank2]:[W1204 13:32:32.953686318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0009643Z [rank1]:[W1204 13:32:32.539233650 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0009818Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0010079Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0010241Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0010606Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0010806Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0010912Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0011008Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0011104Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0011106Z 2025-12-04T13:44:26.0011338Z [rank1]:[W1204 13:32:32.540939032 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0011506Z [rank3]:[W1204 13:32:32.916762046 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0011681Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0011968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0012148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0012517Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0012717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0012843Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0012939Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0013037Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0013039Z 2025-12-04T13:44:26.0013271Z [rank3]:[W1204 13:32:32.919020177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0013441Z [rank2]:[W1204 13:32:33.953836161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0013614Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0013870Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0014034Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0014404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0014606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0014711Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0014808Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0014906Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0014909Z 2025-12-04T13:44:26.0015143Z [rank2]:[W1204 13:32:33.955923845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0015318Z [rank1]:[W1204 13:32:33.541078855 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0015492Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0015749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0015935Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0016314Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0016516Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0016621Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0016728Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0016824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0016830Z 2025-12-04T13:44:26.0017063Z [rank1]:[W1204 13:32:33.543145160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0017232Z [rank3]:[W1204 13:32:33.919205859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0017407Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0017705Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0017870Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0018236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0018436Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0018541Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0018636Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0018734Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0018736Z 2025-12-04T13:44:26.0018972Z [rank3]:[W1204 13:32:33.921101957 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0019144Z [rank2]:[W1204 13:32:34.956087767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0019319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0019576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0019743Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0020138Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0020355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0020459Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0020556Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0020668Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0020670Z 2025-12-04T13:44:26.0020907Z [rank2]:[W1204 13:32:34.958305959 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0021079Z [rank1]:[W1204 13:32:34.543312562 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0021251Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0021505Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0021667Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0022034Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0022236Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0022340Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0022436Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0022533Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0022536Z 2025-12-04T13:44:26.0022768Z [rank1]:[W1204 13:32:34.545638821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0022940Z [rank3]:[W1204 13:32:34.921250590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0023114Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0023368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0023531Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0023922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0024132Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0024237Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0024331Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0024428Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0024442Z 2025-12-04T13:44:26.0024674Z [rank3]:[W1204 13:32:34.922606980 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0024847Z [rank2]:[W1204 13:32:35.958482441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0025021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0025280Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0025445Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0025814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0026015Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0026119Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0026217Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0026313Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0026316Z 2025-12-04T13:44:26.0026549Z [rank2]:[W1204 13:32:35.960559596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0026722Z [rank1]:[W1204 13:32:35.545815714 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0026896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0027152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0027313Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0027755Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0028007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0028111Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0028209Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0028304Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0028305Z 2025-12-04T13:44:26.0028541Z [rank1]:[W1204 13:32:35.548055404 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0028725Z [rank3]:[W1204 13:32:35.922769253 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0028902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0029156Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0029319Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0029693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0029897Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0030005Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0030099Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0030195Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0030197Z 2025-12-04T13:44:26.0030428Z [rank3]:[W1204 13:32:35.925019814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0030600Z [rank2]:[W1204 13:32:36.960698969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0030778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0031032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0031194Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0031560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0031790Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0031908Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0032005Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0032101Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0032104Z 2025-12-04T13:44:26.0032337Z [rank2]:[W1204 13:32:36.962725074 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0032519Z [rank1]:[W1204 13:32:36.548245797 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0032694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0032950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0033111Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0033479Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0033683Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0033789Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0033884Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0033980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0033982Z 2025-12-04T13:44:26.0034214Z [rank1]:[W1204 13:32:36.550748692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0034384Z [rank3]:[W1204 13:32:36.925190067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0034561Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0034817Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0034982Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0035348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0035549Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0035683Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0035788Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0035883Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0035885Z 2025-12-04T13:44:26.0036119Z [rank3]:[W1204 13:32:36.927250942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0036291Z [rank2]:[W1204 13:32:37.962928827 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0036480Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0036737Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0036902Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0037267Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0037512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0037618Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0037715Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0037813Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0037815Z 2025-12-04T13:44:26.0038048Z [rank2]:[W1204 13:32:37.965072780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0038219Z [rank1]:[W1204 13:32:37.550893256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0038393Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0038649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0038812Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0039177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0039379Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0039483Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0039618Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0039734Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0039736Z 2025-12-04T13:44:26.0039968Z [rank1]:[W1204 13:32:37.552817313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0040136Z [rank3]:[W1204 13:32:37.927406615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0040311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0040583Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0040749Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0041115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0041317Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0041422Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0041517Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0041615Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0041617Z 2025-12-04T13:44:26.0041849Z [rank3]:[W1204 13:32:37.929144967 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0042018Z [rank2]:[W1204 13:32:38.965497617 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0042191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0042446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0042610Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0042979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0043180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0043284Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0043381Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0043491Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0043520Z 2025-12-04T13:44:26.0043753Z [rank2]:[W1204 13:32:38.967422875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0043923Z [rank1]:[W1204 13:32:38.553005727 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0044097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0044352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0044528Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0044897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0045100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0045205Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0045302Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0045398Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0045401Z 2025-12-04T13:44:26.0045635Z [rank1]:[W1204 13:32:38.555248927 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0045805Z [rank3]:[W1204 13:32:38.929287541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0045981Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0046236Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0046400Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0046766Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0046964Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0047070Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0047167Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0047266Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0047268Z 2025-12-04T13:44:26.0047557Z [rank3]:[W1204 13:32:38.930561363 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0047744Z [rank2]:[W1204 13:32:39.967577629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0047917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0048171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0048350Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0048717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0048921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0049025Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0049123Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0049223Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0049225Z 2025-12-04T13:44:26.0049463Z [rank2]:[W1204 13:32:39.969107206 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0049635Z [rank1]:[W1204 13:32:39.555410961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0049808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0050064Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0050227Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0050594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0050797Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0050901Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0050996Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0051091Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0051094Z 2025-12-04T13:44:26.0051350Z [rank1]:[W1204 13:32:39.556648904 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0051545Z [rank3]:[W1204 13:32:39.930923613 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0051722Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0051977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0052139Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0052520Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0052722Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0052827Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0052923Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0053020Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0053023Z 2025-12-04T13:44:26.0053256Z [rank3]:[W1204 13:32:39.933391579 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0053429Z [rank2]:[W1204 13:32:40.969289159 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0053605Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0053862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0054025Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0054393Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0054595Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0054700Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0054797Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0054895Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0054896Z 2025-12-04T13:44:26.0055129Z [rank2]:[W1204 13:32:40.971440682 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0055322Z [rank1]:[W1204 13:32:40.556818378 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0055505Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0055759Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0055921Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0056290Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0056503Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0056606Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0056704Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0056798Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0056800Z 2025-12-04T13:44:26.0057031Z [rank1]:[W1204 13:32:40.558963451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0057203Z [rank3]:[W1204 13:32:40.933539853 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0057378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0057683Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0057845Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0058218Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0058421Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0058526Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0058622Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0058719Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0058721Z 2025-12-04T13:44:26.0058954Z [rank3]:[W1204 13:32:40.934774256 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0059126Z [rank2]:[W1204 13:32:41.971576847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0059335Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0059603Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0059766Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0060131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0060354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0060460Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0060556Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0060652Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0060654Z 2025-12-04T13:44:26.0060885Z [rank2]:[W1204 13:32:41.973630471 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0061055Z [rank1]:[W1204 13:32:41.559219413 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0061230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0061486Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0061647Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0062014Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0062217Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0062322Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0062419Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0062514Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0062516Z 2025-12-04T13:44:26.0062750Z [rank1]:[W1204 13:32:41.562607389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0062918Z [rank3]:[W1204 13:32:41.934968840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0063093Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0063369Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0063541Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0063905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0064117Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0064224Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0064320Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0064416Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0064418Z 2025-12-04T13:44:26.0064655Z [rank3]:[W1204 13:32:42.937346357 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0064825Z [rank2]:[W1204 13:32:42.973769346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0065001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0065255Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0065419Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0065783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0065986Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0066090Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0066187Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0066284Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0066285Z 2025-12-04T13:44:26.0066516Z [rank2]:[W1204 13:32:42.975739023 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0066687Z [rank1]:[W1204 13:32:42.562802203 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0066863Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0067139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0067312Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0067718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0067920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0068037Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0068134Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0068230Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0068232Z 2025-12-04T13:44:26.0068464Z [rank1]:[W1204 13:32:42.565020344 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0068632Z [rank3]:[W1204 13:32:43.937461483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0068807Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0069066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0069230Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0069594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0069794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0069899Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0069994Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0070092Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0070095Z 2025-12-04T13:44:26.0070327Z [rank3]:[W1204 13:32:43.939430790 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0070496Z [rank2]:[W1204 13:32:43.975886368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0070670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0070925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0071113Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0071494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0071696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0071801Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0071912Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0072010Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0072013Z 2025-12-04T13:44:26.0072245Z [rank2]:[W1204 13:32:43.978510210 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0072416Z [rank1]:[W1204 13:32:43.565206828 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0072590Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0072845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0073009Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0073378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0073580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0073683Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0073781Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0073876Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0073879Z 2025-12-04T13:44:26.0074113Z [rank1]:[W1204 13:32:43.567390800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0074283Z [rank3]:[W1204 13:32:44.939515727 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0074458Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0074713Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0074889Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0075277Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0075476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0075581Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0075676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0075783Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0075785Z 2025-12-04T13:44:26.0076020Z [rank3]:[W1204 13:32:44.942258916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0076190Z [rank2]:[W1204 13:32:44.978646626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0076364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0076618Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0076782Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0077147Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0077350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0077461Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0077585Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0077683Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0077685Z 2025-12-04T13:44:26.0077919Z [rank2]:[W1204 13:32:44.980617513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0078090Z [rank1]:[W1204 13:32:44.567568785 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0078263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0078516Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0078679Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0079069Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0079283Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0079387Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0079483Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0079577Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0079593Z 2025-12-04T13:44:26.0079827Z [rank1]:[W1204 13:32:44.569822145 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0080001Z [rank3]:[W1204 13:32:45.942392642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0080178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0080432Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0080592Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0080960Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0081162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0081266Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0081361Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0081457Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0081460Z 2025-12-04T13:44:26.0081692Z [rank3]:[W1204 13:32:45.944800359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0081864Z [rank2]:[W1204 13:32:45.980784238 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0082039Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0082294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0082456Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0082841Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0083053Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0083158Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0083253Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0083351Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0083352Z 2025-12-04T13:44:26.0083585Z [rank2]:[W1204 13:32:45.982768614 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0083768Z [rank1]:[W1204 13:32:45.569994311 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0083942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0084196Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0084360Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0084727Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0084932Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0085036Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0085132Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0085227Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0085229Z 2025-12-04T13:44:26.0085461Z [rank1]:[W1204 13:32:45.571678964 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0085632Z [rank3]:[W1204 13:32:46.944981214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0085808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0086065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0086225Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0086594Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0086820Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0086947Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0087041Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0087136Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0087138Z 2025-12-04T13:44:26.0087370Z [rank3]:[W1204 13:32:46.946971110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0087613Z [rank2]:[W1204 13:32:46.982917080 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0087790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0088045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0088208Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0088573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0088779Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0088884Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0088979Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0089076Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0089078Z 2025-12-04T13:44:26.0089308Z [rank2]:[W1204 13:32:46.984758119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0089480Z [rank1]:[W1204 13:32:46.571837969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0089654Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0089909Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0090071Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0090437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0090655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0090785Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0090882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0090977Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0090979Z 2025-12-04T13:44:26.0091214Z [rank1]:[W1204 13:32:46.574097340 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0091385Z [rank3]:[W1204 13:32:47.947158856 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0091573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0091829Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0091991Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0092356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0092557Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0092663Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0092758Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0092854Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0092857Z 2025-12-04T13:44:26.0093092Z [rank3]:[W1204 13:32:47.948982295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0093261Z [rank2]:[W1204 13:32:47.984937785 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0093435Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0093689Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0093852Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0094217Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0094420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0094535Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0094649Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0094747Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0094749Z 2025-12-04T13:44:26.0094980Z [rank2]:[W1204 13:32:47.987166686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0095151Z [rank1]:[W1204 13:32:47.574290235 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0095326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0095593Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0095758Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0096124Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0096325Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0096431Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0096528Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0096624Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0096626Z 2025-12-04T13:44:26.0096861Z [rank1]:[W1204 13:32:47.576112415 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0097033Z [rank3]:[W1204 13:32:48.949112522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0097206Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0097466Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0097676Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0098042Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0098243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0098351Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0098446Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0098571Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0098587Z 2025-12-04T13:44:26.0098819Z [rank3]:[W1204 13:32:48.951180637 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0098991Z [rank2]:[W1204 13:32:48.987341541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0099167Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0099436Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0099602Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0099974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0100176Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0100280Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0100375Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0100473Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0100477Z 2025-12-04T13:44:26.0100708Z [rank2]:[W1204 13:32:48.989580972 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0100880Z [rank1]:[W1204 13:32:48.576282930 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0101052Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0101308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0101474Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0101842Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0102046Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0102150Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0102246Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0102341Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0102343Z 2025-12-04T13:44:26.0102608Z [rank1]:[W1204 13:32:48.578101640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0102791Z [rank3]:[W1204 13:32:49.951352393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0102965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0103220Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0103393Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0103763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0103964Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0104069Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0104165Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0104261Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0104263Z 2025-12-04T13:44:26.0104500Z [rank3]:[W1204 13:32:49.953716671 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0104670Z [rank2]:[W1204 13:32:49.989755238 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0104844Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0105101Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0105265Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0105635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0105836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0105941Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0106036Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0106135Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0106137Z 2025-12-04T13:44:26.0106390Z [rank2]:[W1204 13:32:49.991812163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0106570Z [rank1]:[W1204 13:32:49.578268177 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0106744Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0106999Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0107162Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0107577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0107780Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0107883Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0107979Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0108075Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0108078Z 2025-12-04T13:44:26.0108312Z [rank1]:[W1204 13:32:49.580008369 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0108487Z [rank3]:[W1204 13:32:50.953886517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0108660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0108915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0109078Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0109445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0109648Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0109752Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0109848Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0109943Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0109946Z 2025-12-04T13:44:26.0110178Z [rank3]:[W1204 13:32:50.956300484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0110372Z [rank2]:[W1204 13:32:50.991995159 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0110560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0110816Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0115476Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0115889Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0116095Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0116203Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0116300Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0116398Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0116400Z 2025-12-04T13:44:26.0116636Z [rank2]:[W1204 13:32:50.994251639 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0116818Z [rank1]:[W1204 13:32:50.580417910 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0116995Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0117253Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0117419Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0117834Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0118040Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0118144Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0118240Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0118335Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0118337Z 2025-12-04T13:44:26.0118570Z [rank1]:[W1204 13:32:50.582460935 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0118761Z [rank3]:[W1204 13:32:51.956471480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0118964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0119222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0119384Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0119752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0119971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0120076Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0120172Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0120266Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0120268Z 2025-12-04T13:44:26.0120501Z [rank3]:[W1204 13:32:51.958600124 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0120671Z [rank2]:[W1204 13:32:51.994428256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0120847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0121103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0121266Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0121631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0121835Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0121941Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0122036Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0122133Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0122135Z 2025-12-04T13:44:26.0122367Z [rank2]:[W1204 13:32:51.996413482 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0122537Z [rank1]:[W1204 13:32:51.582615142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0122727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0123003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0123165Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0123530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0123743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0123848Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0123946Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0124041Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0124043Z 2025-12-04T13:44:26.0124276Z [rank1]:[W1204 13:32:51.583856784 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0124446Z [rank3]:[W1204 13:32:52.958798970 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0124620Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0124878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0125042Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0125410Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0125612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0125717Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0125813Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0125908Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0125910Z 2025-12-04T13:44:26.0126145Z [rank3]:[W1204 13:32:52.961042490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0126314Z [rank2]:[W1204 13:32:52.996794604 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0126490Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0126761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0126935Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0127307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0127539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0127658Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0127755Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0127853Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0127855Z 2025-12-04T13:44:26.0128086Z [rank2]:[W1204 13:32:52.998946427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0128257Z [rank1]:[W1204 13:32:52.583997732 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0128430Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0128687Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0128851Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0129214Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0129416Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0129519Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0129615Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0129711Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0129713Z 2025-12-04T13:44:26.0129945Z [rank1]:[W1204 13:32:52.585251104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0130116Z [rank3]:[W1204 13:32:53.961190858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0130289Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0130560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0130748Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0131114Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0131315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0131429Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0131525Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0131621Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0131624Z 2025-12-04T13:44:26.0131857Z [rank3]:[W1204 13:32:53.962796522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0132027Z [rank2]:[W1204 13:32:53.999095144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0132201Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0132457Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0132622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0132991Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0133192Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0133296Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0133392Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0133491Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0133494Z 2025-12-04T13:44:26.0133727Z [rank2]:[W1204 13:32:53.001353704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0133897Z [rank1]:[W1204 13:32:53.585410442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0134070Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0134324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0134497Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0134886Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0135087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0135190Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0135302Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0135397Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0135399Z 2025-12-04T13:44:26.0135632Z [rank1]:[W1204 13:32:53.586657494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0135803Z [rank3]:[W1204 13:32:54.962961630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0135976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0136232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0136395Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0136762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0136966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0137070Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0137166Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0137263Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0137265Z 2025-12-04T13:44:26.0137543Z [rank3]:[W1204 13:32:54.965292358 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0137712Z [rank2]:[W1204 13:32:54.001520412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0137885Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0138139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0138303Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0138697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0138908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0139013Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0139109Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0139219Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0139220Z 2025-12-04T13:44:26.0139455Z [rank2]:[W1204 13:32:54.003439410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0139626Z [rank1]:[W1204 13:32:54.586808712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0139800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0140053Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0140217Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0140585Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0140786Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0140889Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0140983Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0141079Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0141083Z 2025-12-04T13:44:26.0141316Z [rank1]:[W1204 13:32:54.588274840 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0141490Z [rank3]:[W1204 13:32:55.965477725 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0141666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0141923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0142086Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0142473Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0142685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0142788Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0142883Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0142978Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0142981Z 2025-12-04T13:44:26.0143223Z [rank3]:[W1204 13:32:55.967240787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0143394Z [rank2]:[W1204 13:32:55.003579288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0143570Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0143826Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0143989Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0144359Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0144561Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0144665Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0144760Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0144857Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0144859Z 2025-12-04T13:44:26.0145091Z [rank2]:[W1204 13:32:55.005737940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0145263Z [rank1]:[W1204 13:32:55.588434517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0145438Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0145694Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0145860Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0146237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0146460Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0146564Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0146660Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0146755Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0146758Z 2025-12-04T13:44:26.0146989Z [rank1]:[W1204 13:32:55.589694430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0147172Z [rank3]:[W1204 13:32:56.967427314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0147348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0147643Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0147805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0148177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0148380Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0148484Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0148580Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0148676Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0148678Z 2025-12-04T13:44:26.0148911Z [rank3]:[W1204 13:32:56.969770282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0149081Z [rank2]:[W1204 13:32:56.005897718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0149257Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0149512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0149676Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0150044Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0150277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0150400Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0150494Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0150590Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0150592Z 2025-12-04T13:44:26.0150823Z [rank2]:[W1204 13:32:56.007816956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0150992Z [rank1]:[W1204 13:32:56.589881267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0151184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0151441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0151603Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0151967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0152169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0152273Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0152370Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0152467Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0152468Z 2025-12-04T13:44:26.0152699Z [rank1]:[W1204 13:32:56.591796905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0152869Z [rank3]:[W1204 13:32:57.969946740 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0153043Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0153299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0153460Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0153827Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0154028Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0154143Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0154259Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0154355Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0154357Z 2025-12-04T13:44:26.0154592Z [rank3]:[W1204 13:32:57.972238230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0154760Z [rank2]:[W1204 13:32:57.007984154 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0154946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0155203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0155367Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0155732Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0155932Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0156037Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0156133Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0156230Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0156232Z 2025-12-04T13:44:26.0156463Z [rank2]:[W1204 13:32:57.010556447 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0156634Z [rank1]:[W1204 13:32:57.591928224 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0156813Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0157070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0157232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0157634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0157835Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0157939Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0158056Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0158179Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0158181Z 2025-12-04T13:44:26.0158418Z [rank1]:[W1204 13:32:57.593288224 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0158588Z [rank3]:[W1204 13:32:58.972443967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0158762Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0159041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0159205Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0159573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0159782Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0159890Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0159986Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0160083Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0160086Z 2025-12-04T13:44:26.0160319Z [rank3]:[W1204 13:32:58.974920993 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0160492Z [rank2]:[W1204 13:32:58.010718675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0160669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0160925Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0161090Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0161461Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0161662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0161767Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0161862Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0161974Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0161986Z 2025-12-04T13:44:26.0162228Z [rank2]:[W1204 13:32:58.013079014 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0162398Z [rank1]:[W1204 13:32:58.593419173 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0162573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0162827Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0163006Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0163378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0163578Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0163682Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0163780Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0163878Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0163880Z 2025-12-04T13:44:26.0164112Z [rank1]:[W1204 13:32:58.595163155 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0164284Z [rank3]:[W1204 13:32:59.975111230 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0164457Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0164712Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0164877Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0165250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0165453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0165557Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0165653Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0165749Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0165751Z 2025-12-04T13:44:26.0166013Z [rank3]:[W1204 13:32:59.977614756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0166192Z [rank2]:[W1204 13:32:59.013236232 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0166369Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0166627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0166801Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0167169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0167370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0167505Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0167603Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0167704Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0167706Z 2025-12-04T13:44:26.0167942Z [rank2]:[W1204 13:32:59.015504862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0168113Z [rank1]:[W1204 13:32:59.595334583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0168287Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0168541Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0168704Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0169074Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0169278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0169385Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0169484Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0169580Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0169583Z 2025-12-04T13:44:26.0169844Z [rank1]:[W1204 13:32:59.597377078 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0170043Z [rank3]:[W1204 13:33:00.977747255 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0170217Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0170474Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0170636Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0171023Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0171225Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0171328Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0171424Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0171518Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0171520Z 2025-12-04T13:44:26.0171754Z [rank3]:[W1204 13:33:00.980139992 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0171924Z [rank2]:[W1204 13:33:00.015672391 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0172105Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0172364Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0172526Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0172897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0173099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0173204Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0173300Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0173399Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0173401Z 2025-12-04T13:44:26.0173638Z [rank2]:[W1204 13:33:00.017210947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0173832Z [rank1]:[W1204 13:33:00.597565567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0174019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0174275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0174438Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0174805Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0175025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0175134Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0175228Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0175325Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0175327Z 2025-12-04T13:44:26.0175559Z [rank1]:[W1204 13:33:00.600135640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0175730Z [rank3]:[W1204 13:33:01.980319801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0175904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0176160Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0176324Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0176702Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0176905Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0177009Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0177104Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0177200Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0177202Z 2025-12-04T13:44:26.0177436Z [rank3]:[W1204 13:33:01.982688069 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0177650Z [rank2]:[W1204 13:33:01.017369956 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0177854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0178123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0178284Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0178651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0178873Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0178980Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0179077Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0179174Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0179176Z 2025-12-04T13:44:26.0179409Z [rank2]:[W1204 13:33:01.019970299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0179579Z [rank1]:[W1204 13:33:01.600308139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0179755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0180010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0180175Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0180541Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0180744Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0180850Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0180947Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0181044Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0181046Z 2025-12-04T13:44:26.0181277Z [rank1]:[W1204 13:33:01.602259546 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0181448Z [rank3]:[W1204 13:33:02.982866428 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0181625Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0181911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0182084Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0182451Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0182667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0182771Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0182871Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0182969Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0182971Z 2025-12-04T13:44:26.0183210Z [rank3]:[W1204 13:33:02.985268685 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0183381Z [rank2]:[W1204 13:33:02.020134798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0183557Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0183814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0183978Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0184343Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0184543Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0184650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0184746Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0184844Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0184846Z 2025-12-04T13:44:26.0185078Z [rank2]:[W1204 13:33:02.022412208 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0185248Z [rank1]:[W1204 13:33:02.602432455 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0185425Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0185702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0185874Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0186239Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0186440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0186557Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0186652Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0186753Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0186755Z 2025-12-04T13:44:26.0186990Z [rank1]:[W1204 13:33:02.604027260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0187160Z [rank3]:[W1204 13:33:03.985611710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0187333Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0187633Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0187798Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0188169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0188371Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0188474Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0188569Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0188666Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0188669Z 2025-12-04T13:44:26.0188900Z [rank3]:[W1204 13:33:03.987803822 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0189070Z [rank2]:[W1204 13:33:03.022559938 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0189242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0189498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0189693Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0190075Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0190277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0190381Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0190489Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0190587Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0190590Z 2025-12-04T13:44:26.0190828Z [rank2]:[W1204 13:33:03.024901947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0190998Z [rank1]:[W1204 13:33:03.604192670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0191174Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0191427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0191592Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0191965Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0192169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0192273Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0192369Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0192464Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0192466Z 2025-12-04T13:44:26.0192700Z [rank1]:[W1204 13:33:03.606083768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0192870Z [rank3]:[W1204 13:33:04.987950522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0193043Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0193300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0193463Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0193853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0194065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0194171Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0194267Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0194382Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0194384Z 2025-12-04T13:44:26.0194623Z [rank3]:[W1204 13:33:04.990015077 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0194796Z [rank2]:[W1204 13:33:04.025070636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0194969Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0195224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0195387Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0195757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0195961Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0196066Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0196164Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0196260Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0196263Z 2025-12-04T13:44:26.0196497Z [rank2]:[W1204 13:33:04.027009243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0196666Z [rank1]:[W1204 13:33:04.606553801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0196841Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0197098Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0197262Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0197698Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0197911Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0198016Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0198111Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0198208Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0198225Z 2025-12-04T13:44:26.0198462Z [rank1]:[W1204 13:33:04.608903460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0198636Z [rank3]:[W1204 13:33:05.990189177 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0198810Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0199064Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0199227Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0199593Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0199800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0199905Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0200001Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0200096Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0200098Z 2025-12-04T13:44:26.0200333Z [rank3]:[W1204 13:33:05.992405318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0200505Z [rank2]:[W1204 13:33:05.027181973 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0200680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0200938Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0201103Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0201485Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0201707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0201812Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0201908Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0202003Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0202005Z 2025-12-04T13:44:26.0202238Z [rank2]:[W1204 13:33:05.029928233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0202421Z [rank1]:[W1204 13:33:05.609079019 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0202600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0202858Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0203021Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0203387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0203593Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0203699Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0203794Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0203890Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0203893Z 2025-12-04T13:44:26.0204126Z [rank1]:[W1204 13:33:05.611047666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0204297Z [rank3]:[W1204 13:33:06.992575838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0204471Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0204728Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0204892Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0205264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0205488Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0205601Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0205696Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0205792Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0205794Z 2025-12-04T13:44:26.0206026Z [rank3]:[W1204 13:33:06.994887007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0206210Z [rank2]:[W1204 13:33:06.030081813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0206385Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0206641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0206807Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0207177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0207384Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0207515Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0207615Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0207711Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0207713Z 2025-12-04T13:44:26.0207946Z [rank2]:[W1204 13:33:06.032434461 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0208115Z [rank1]:[W1204 13:33:06.611225666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0208290Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0208545Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0208710Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0209077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0209277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0209409Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0209526Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0209624Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0209626Z 2025-12-04T13:44:26.0209857Z [rank1]:[W1204 13:33:06.613282881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0210030Z [rank3]:[W1204 13:33:07.995054727 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0210219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0210476Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0210639Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0211005Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0211208Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0211313Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0211414Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0211511Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0211514Z 2025-12-04T13:44:26.0211747Z [rank3]:[W1204 13:33:07.997242719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0211917Z [rank2]:[W1204 13:33:07.032571012 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0212091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0212348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0212510Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0212879Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0213079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0213183Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0213301Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0213406Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0213408Z 2025-12-04T13:44:26.0213640Z [rank2]:[W1204 13:33:07.034114038 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0213810Z [rank1]:[W1204 13:33:07.613459521 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0213987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0214261Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0214426Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0214792Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0214992Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0215097Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0215196Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0215298Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0215300Z 2025-12-04T13:44:26.0215532Z [rank1]:[W1204 13:33:07.615627163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0215703Z [rank3]:[W1204 13:33:08.997399150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0215878Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0216134Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0216299Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0216666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0216868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0216973Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0217071Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0217187Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0217200Z 2025-12-04T13:44:26.0217433Z [rank3]:[W1204 13:33:08.998926926 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0217647Z [rank2]:[W1204 13:33:08.034225010 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0217825Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0218084Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0218268Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0218636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0218838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0218941Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0219041Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0219138Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0219141Z 2025-12-04T13:44:26.0219377Z [rank2]:[W1204 13:33:08.036499890 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0219548Z [rank1]:[W1204 13:33:08.615794034 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0219722Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0219977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0220142Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0220515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0220716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0220823Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0220917Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0221016Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0221017Z 2025-12-04T13:44:26.0221277Z [rank1]:[W1204 13:33:08.617128975 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0221460Z [rank3]:[W1204 13:33:09.999122176 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0221638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0221897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0222074Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0222440Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0222645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0222750Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0222848Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0222945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0222948Z 2025-12-04T13:44:26.0223181Z [rank3]:[W1204 13:33:09.001088963 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0223353Z [rank2]:[W1204 13:33:09.036640751 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0223525Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0223778Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0223941Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0224313Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0224517Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0224621Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0224718Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0224814Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0224817Z 2025-12-04T13:44:26.0225064Z [rank2]:[W1204 13:33:09.038013781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0225272Z [rank1]:[W1204 13:33:09.617304035 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0225451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0225704Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0225866Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0226248Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0226450Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0226556Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0226653Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0226752Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0226755Z 2025-12-04T13:44:26.0226988Z [rank1]:[W1204 13:33:09.619556896 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0227161Z [rank3]:[W1204 13:33:10.001270744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0227336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0227633Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0227796Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0228165Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0228367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0228471Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0228566Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0228662Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0228664Z 2025-12-04T13:44:26.0228900Z [rank3]:[W1204 13:33:10.003411746 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0229101Z [rank2]:[W1204 13:33:10.038155643 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0229289Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0229545Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0229707Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0230073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0230291Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0230394Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0230491Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0230589Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0230590Z 2025-12-04T13:44:26.0230823Z [rank2]:[W1204 13:33:10.040096280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0230995Z [rank1]:[W1204 13:33:10.619722667 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0231172Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0231430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0231592Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0231959Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0232161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0232266Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0232362Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0232459Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0232461Z 2025-12-04T13:44:26.0232694Z [rank1]:[W1204 13:33:10.621853850 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0232865Z [rank3]:[W1204 13:33:11.003584928 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0233064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0233341Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0233506Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0233874Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0234085Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0234193Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0234291Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0234387Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0234389Z 2025-12-04T13:44:26.0234623Z [rank3]:[W1204 13:33:11.005313270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0234794Z [rank2]:[W1204 13:33:11.040207452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0234968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0235222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0235383Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0235753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0235958Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0236064Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0236163Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0236260Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0236262Z 2025-12-04T13:44:26.0236496Z [rank2]:[W1204 13:33:11.041534893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0236667Z [rank1]:[W1204 13:33:11.622031301 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0236844Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0237126Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0237299Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0237700Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0237915Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0238022Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0238118Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0238214Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0238216Z 2025-12-04T13:44:26.0238452Z [rank1]:[W1204 13:33:11.624040597 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0238622Z [rank3]:[W1204 13:33:12.005476921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0238798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0239054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0239218Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0239585Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0239789Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0239896Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0239993Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0240090Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0240092Z 2025-12-04T13:44:26.0240323Z [rank3]:[W1204 13:33:12.007935087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0240493Z [rank2]:[W1204 13:33:12.041703895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0240668Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0240967Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0241147Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0241513Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0241715Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0241829Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0241926Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0242023Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0242025Z 2025-12-04T13:44:26.0242263Z [rank2]:[W1204 13:33:12.043730580 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0242432Z [rank1]:[W1204 13:33:12.624225908 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0242607Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0242862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0243026Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0243395Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0243597Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0243703Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0243798Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0243897Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0243900Z 2025-12-04T13:44:26.0244135Z [rank1]:[W1204 13:33:12.626467549 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0244305Z [rank3]:[W1204 13:33:13.008072709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0244482Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0244739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0244929Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0245306Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0245508Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0245622Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0245717Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0245814Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0245818Z 2025-12-04T13:44:26.0246053Z [rank3]:[W1204 13:33:13.010480976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0246224Z [rank2]:[W1204 13:33:13.043881562 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0246400Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0246658Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0246823Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0247194Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0247397Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0247540Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0247638Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0247734Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0247737Z 2025-12-04T13:44:26.0247973Z [rank2]:[W1204 13:33:13.046205311 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0248142Z [rank1]:[W1204 13:33:13.626658300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0248318Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0248574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0248756Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0249153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0249354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0249459Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0249554Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0249667Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0249670Z 2025-12-04T13:44:26.0249906Z [rank1]:[W1204 13:33:13.627964631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0250076Z [rank3]:[W1204 13:33:14.010655578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0250251Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0250505Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0250670Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0251042Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0251247Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0251353Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0251449Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0251546Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0251548Z 2025-12-04T13:44:26.0251781Z [rank3]:[W1204 13:33:14.012812911 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0251952Z [rank2]:[W1204 13:33:14.046305094 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0252127Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0252384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0252550Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0252936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0253148Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0253253Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0253351Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0253448Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0253470Z 2025-12-04T13:44:26.0253703Z [rank2]:[W1204 13:33:14.047539947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0253876Z [rank1]:[W1204 13:33:14.628132733 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0254052Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0254310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0254472Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0254844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0255045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0255150Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0255245Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0255343Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0255346Z 2025-12-04T13:44:26.0255587Z [rank1]:[W1204 13:33:14.630390473 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0255759Z [rank3]:[W1204 13:33:15.012983163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0255934Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0256188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0256350Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0256741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0256959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0257063Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0257157Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0257254Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0257256Z 2025-12-04T13:44:26.0257529Z [rank3]:[W1204 13:33:15.015308351 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0257720Z [rank2]:[W1204 13:33:15.047690670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0257895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0258152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0258316Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0258685Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0258890Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0258994Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0259089Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0259188Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0259189Z 2025-12-04T13:44:26.0259426Z [rank2]:[W1204 13:33:15.048953402 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0259597Z [rank1]:[W1204 13:33:15.630818160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0259772Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0260027Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0260190Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0260560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0260790Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0260906Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0261002Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0261098Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0261100Z 2025-12-04T13:44:26.0261332Z [rank1]:[W1204 13:33:15.632727378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0261513Z [rank3]:[W1204 13:33:16.015484613 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0261692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0261951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0262115Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0262480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0262682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0262787Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0262886Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0262983Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0262985Z 2025-12-04T13:44:26.0263218Z [rank3]:[W1204 13:33:16.016698857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0263388Z [rank2]:[W1204 13:33:16.049126314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0263563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0263820Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0263983Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0264355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0264569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0264693Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0264789Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0264885Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0264887Z 2025-12-04T13:44:26.0265121Z [rank2]:[W1204 13:33:16.050993983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0265291Z [rank1]:[W1204 13:33:16.632891560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0265475Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0265733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0265896Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0266263Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0266467Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0266577Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0266676Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0266772Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0266774Z 2025-12-04T13:44:26.0267007Z [rank1]:[W1204 13:33:16.634881557 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0267175Z [rank3]:[W1204 13:33:17.017095464 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0267352Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0267644Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0267812Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0268181Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0268381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0268504Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0268629Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0268726Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0268728Z 2025-12-04T13:44:26.0268960Z [rank3]:[W1204 13:33:17.019143969 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0269136Z [rank2]:[W1204 13:33:17.051112036 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0269311Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0269584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0269749Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0270113Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0270317Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0270423Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0270522Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0270619Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0270621Z 2025-12-04T13:44:26.0270855Z [rank2]:[W1204 13:33:17.052872958 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0271025Z [rank1]:[W1204 13:33:17.635056279 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0271197Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0271454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0271621Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0271988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0272191Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0272296Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0272392Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0272512Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0272524Z 2025-12-04T13:44:26.0272758Z [rank1]:[W1204 13:33:17.637320200 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0272930Z [rank3]:[W1204 13:33:18.019273353 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0273110Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0273380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0273544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0273910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0274111Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0274216Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0274311Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0274408Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0274412Z 2025-12-04T13:44:26.0274643Z [rank3]:[W1204 13:33:18.020521985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0274813Z [rank2]:[W1204 13:33:18.053055430 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0274985Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0275244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0275412Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0275779Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0275981Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0276085Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0276182Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0276278Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0276294Z 2025-12-04T13:44:26.0276551Z [rank2]:[W1204 13:33:18.055035017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0276721Z [rank1]:[W1204 13:33:18.637525752 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0276894Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0277148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0277323Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0277730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0277934Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0278037Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0278133Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0278228Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0278230Z 2025-12-04T13:44:26.0278463Z [rank1]:[W1204 13:33:18.639696314 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0278633Z [rank3]:[W1204 13:33:19.020663899 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0278808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0279065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0279234Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0279604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0279807Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0279913Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0280007Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0280104Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0280106Z 2025-12-04T13:44:26.0280371Z [rank3]:[W1204 13:33:19.022014229 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0280557Z [rank2]:[W1204 13:33:19.055202440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0280731Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0280984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0281160Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0281526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0281734Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0281840Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0281937Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0282033Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0282037Z 2025-12-04T13:44:26.0282270Z [rank2]:[W1204 13:33:19.057473450 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0282441Z [rank1]:[W1204 13:33:19.639867097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0282615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0282873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0283037Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0283406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0283607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0283709Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0283805Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0283900Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0283902Z 2025-12-04T13:44:26.0284135Z [rank1]:[W1204 13:33:19.641902582 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0284335Z [rank3]:[W1204 13:33:20.022156983 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0284523Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0284780Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0284943Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0285326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0285528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0285634Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0285730Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0285828Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0285829Z 2025-12-04T13:44:26.0286065Z [rank3]:[W1204 13:33:20.023392276 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0286239Z [rank2]:[W1204 13:33:20.057643093 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0286417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0286673Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0286838Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0287209Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0287414Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0287552Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0287649Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0287744Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0287747Z 2025-12-04T13:44:26.0287979Z [rank2]:[W1204 13:33:20.059509152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0288172Z [rank1]:[W1204 13:33:20.642092955 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0288377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0288636Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0288798Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0289162Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0289381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0289490Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0289585Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0289680Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0289682Z 2025-12-04T13:44:26.0289915Z [rank1]:[W1204 13:33:20.643669230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0290084Z [rank3]:[W1204 13:33:21.023565439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0290261Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0290519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0290688Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0291054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0291257Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0291363Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0291458Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0291554Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0291555Z 2025-12-04T13:44:26.0291790Z [rank3]:[W1204 13:33:21.024949859 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0291965Z [rank2]:[W1204 13:33:21.059701295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0292154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0292428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0292591Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0292963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0293181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0293288Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0293382Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0293479Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0293481Z 2025-12-04T13:44:26.0293712Z [rank2]:[W1204 13:33:21.060981536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0293881Z [rank1]:[W1204 13:33:21.643830474 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0294056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0294313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0294474Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0294842Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0295045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0295151Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0295249Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0295345Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0295348Z 2025-12-04T13:44:26.0295579Z [rank1]:[W1204 13:33:21.645096006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0295750Z [rank3]:[W1204 13:33:22.025142941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0295925Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0296202Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0296377Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0296744Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0296944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0297063Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0297160Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0297256Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0297258Z 2025-12-04T13:44:26.0297543Z [rank3]:[W1204 13:33:22.026866244 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0297715Z [rank2]:[W1204 13:33:22.061127371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0297891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0298146Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0298310Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0298675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0298881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0298991Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0299087Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0299186Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0299188Z 2025-12-04T13:44:26.0299421Z [rank2]:[W1204 13:33:22.062338574 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0299594Z [rank1]:[W1204 13:33:22.645248600 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0299768Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0300048Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0300240Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0300609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0300810Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0300928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0301024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0301120Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0301123Z 2025-12-04T13:44:26.0301358Z [rank1]:[W1204 13:33:22.646513472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0301528Z [rank3]:[W1204 13:33:23.027045877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0301703Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0301962Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0302125Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0302494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0302697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0302801Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0302896Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0302994Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0302996Z 2025-12-04T13:44:26.0303231Z [rank3]:[W1204 13:33:23.029028604 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0303399Z [rank2]:[W1204 13:33:23.062517358 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0303575Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0303836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0304013Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0304401Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0304603Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0304709Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0304813Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0304914Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0304915Z 2025-12-04T13:44:26.0305150Z [rank2]:[W1204 13:33:23.064468645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0305322Z [rank1]:[W1204 13:33:23.646642367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0305495Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0305750Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0305914Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0306287Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0306488Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0306591Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0306689Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0306786Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0306788Z 2025-12-04T13:44:26.0307021Z [rank1]:[W1204 13:33:23.648972266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0307192Z [rank3]:[W1204 13:33:24.029176808 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0307367Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0307672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0307834Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0308233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0308445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0308553Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0308648Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0308761Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0308763Z 2025-12-04T13:44:26.0308999Z [rank3]:[W1204 13:33:24.031229603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0309168Z [rank2]:[W1204 13:33:24.064610649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0309342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0309595Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0309757Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0310124Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0310327Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0310431Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0310526Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0310622Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0310627Z 2025-12-04T13:44:26.0310858Z [rank2]:[W1204 13:33:24.065837232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0311029Z [rank1]:[W1204 13:33:24.649110291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0311204Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0311462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0311625Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0312013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0312226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0312330Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0312428Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0312528Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0312530Z 2025-12-04T13:44:26.0312782Z [rank1]:[W1204 13:33:24.651386911 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0312952Z [rank3]:[W1204 13:33:25.031361348 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0313128Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0313384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0313546Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0313917Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0314117Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0314221Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0314315Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0314411Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0314413Z 2025-12-04T13:44:26.0314649Z [rank3]:[W1204 13:33:25.032589001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0314821Z [rank2]:[W1204 13:33:25.065981267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0314997Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0315250Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0315413Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0315789Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0316012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0316120Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0316217Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0316316Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0316317Z 2025-12-04T13:44:26.0316548Z [rank2]:[W1204 13:33:25.067946344 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0316735Z [rank1]:[W1204 13:33:25.651527056 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0316910Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0317165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0317332Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0317738Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0317941Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0318044Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0318141Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0318236Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0318238Z 2025-12-04T13:44:26.0318469Z [rank1]:[W1204 13:33:25.652960484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0318644Z [rank3]:[W1204 13:33:26.032786975 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0318822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0319081Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0319244Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0319608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0319843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0319964Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0320061Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0320156Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0320158Z 2025-12-04T13:44:26.0320392Z [rank3]:[W1204 13:33:26.034573096 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0320560Z [rank2]:[W1204 13:33:26.068108629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0320754Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0321009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0321176Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0321546Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0321748Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0321854Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0321949Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0322046Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0322048Z 2025-12-04T13:44:26.0322282Z [rank2]:[W1204 13:33:26.070042856 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0322454Z [rank1]:[W1204 13:33:26.653067200 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0322628Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0322883Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0323046Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0323415Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0323620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0323736Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0323853Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0323949Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0323950Z 2025-12-04T13:44:26.0324183Z [rank1]:[W1204 13:33:26.654413211 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0324353Z [rank3]:[W1204 13:33:27.034711761 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0324541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0324802Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0324963Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0325328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0325528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0325634Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0325736Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0325835Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0325838Z 2025-12-04T13:44:26.0326074Z [rank3]:[W1204 13:33:27.036104000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0326242Z [rank2]:[W1204 13:33:27.070184791 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0326417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0326672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0326836Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0327204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0327404Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0327551Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0327664Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0327793Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0327795Z 2025-12-04T13:44:26.0328029Z [rank2]:[W1204 13:33:27.072242536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0328205Z [rank1]:[W1204 13:33:27.654557946 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0328380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0328655Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0328821Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0329185Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0329389Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0329495Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0329592Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0329689Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0329692Z 2025-12-04T13:44:26.0329926Z [rank1]:[W1204 13:33:27.656707209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0330097Z [rank3]:[W1204 13:33:28.036278525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0330270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0330526Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0330693Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0331065Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0331264Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0331368Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0331464Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0331573Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0331584Z 2025-12-04T13:44:26.0331830Z [rank3]:[W1204 13:33:28.037527508 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0332003Z [rank2]:[W1204 13:33:28.072385281 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0332178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0332433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0332609Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0332975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0333178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0333283Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0333378Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0333474Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0333476Z 2025-12-04T13:44:26.0333710Z [rank2]:[W1204 13:33:28.074462266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0333880Z [rank1]:[W1204 13:33:28.656868974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0334053Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0334311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0334478Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0334843Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0335048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0335151Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0335247Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0335343Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0335345Z 2025-12-04T13:44:26.0335604Z [rank1]:[W1204 13:33:28.658143656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0335788Z [rank3]:[W1204 13:33:29.037690193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0335961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0336217Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0336391Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0336765Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0336967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0337070Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0337169Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0337266Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0337268Z 2025-12-04T13:44:26.0337549Z [rank3]:[W1204 13:33:29.038923106 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0337720Z [rank2]:[W1204 13:33:29.074884975 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0337895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0338153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0338317Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0338686Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0338888Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0338994Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0339089Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0339191Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0339194Z 2025-12-04T13:44:26.0339444Z [rank2]:[W1204 13:33:29.076950190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0339647Z [rank1]:[W1204 13:33:29.658273482 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0339824Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0340078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0340241Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0340621Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0340824Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0340927Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0341024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0341119Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0341123Z 2025-12-04T13:44:26.0341358Z [rank1]:[W1204 13:33:29.659509405 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0341533Z [rank3]:[W1204 13:33:30.039085151 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0341709Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0341963Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0342125Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0342492Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0342698Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0342805Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0342901Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0342996Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0342998Z 2025-12-04T13:44:26.0343236Z [rank3]:[W1204 13:33:30.040625147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0343438Z [rank2]:[W1204 13:33:30.077105846 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0343624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0343876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0344041Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0344410Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0344627Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0344733Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0344828Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0344926Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0344928Z 2025-12-04T13:44:26.0345160Z [rank2]:[W1204 13:33:30.079168590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0345335Z [rank1]:[W1204 13:33:30.659651371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0345513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0345766Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0345929Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0346292Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0346498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0346605Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0346702Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0346797Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0346800Z 2025-12-04T13:44:26.0347033Z [rank1]:[W1204 13:33:30.660937913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0347205Z [rank3]:[W1204 13:33:31.041109536 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0347402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0347712Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0347876Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0348245Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0348462Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0348568Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0348664Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0348760Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0348762Z 2025-12-04T13:44:26.0348997Z [rank3]:[W1204 13:33:31.042466026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0349169Z [rank2]:[W1204 13:33:31.079321946 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0349347Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0349602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0349765Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0350134Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0350335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0350445Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0350544Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0350641Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0350643Z 2025-12-04T13:44:26.0350879Z [rank2]:[W1204 13:33:31.080558079 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0351054Z [rank1]:[W1204 13:33:31.661089189 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0351230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0351512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0351697Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0352064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0352280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0352385Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0352482Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0352580Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0352582Z 2025-12-04T13:44:26.0352814Z [rank1]:[W1204 13:33:31.662366580 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0352988Z [rank3]:[W1204 13:33:32.042679581 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0353165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0353423Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0353584Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0353950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0354154Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0354258Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0354355Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0354451Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0354452Z 2025-12-04T13:44:26.0354686Z [rank3]:[W1204 13:33:32.044930332 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0354855Z [rank2]:[W1204 13:33:32.080718575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0355028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0361068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0361263Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0361636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0361838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0361969Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0362068Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0362167Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0362169Z 2025-12-04T13:44:26.0362406Z [rank2]:[W1204 13:33:32.082619903 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0362578Z [rank1]:[W1204 13:33:32.662515757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0362755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0363014Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0363182Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0363554Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0363758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0363863Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0363963Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0364061Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0364065Z 2025-12-04T13:44:26.0364297Z [rank1]:[W1204 13:33:32.663801719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0364467Z [rank3]:[W1204 13:33:33.045335062 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0364644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0364904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0365099Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0365481Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0365683Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0365790Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0365899Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0365995Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0365998Z 2025-12-04T13:44:26.0366233Z [rank3]:[W1204 13:33:33.047346678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0366404Z [rank2]:[W1204 13:33:33.082739020 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0366578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0366833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0367002Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0367373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0367620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0367725Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0367822Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0367919Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0367921Z 2025-12-04T13:44:26.0368154Z [rank2]:[W1204 13:33:33.084889443 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0368327Z [rank1]:[W1204 13:33:33.663983895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0368502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0368754Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0368918Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0369311Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0369530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0369635Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0369732Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0369848Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0369850Z 2025-12-04T13:44:26.0370086Z [rank1]:[W1204 13:33:33.665547860 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0370258Z [rank3]:[W1204 13:33:34.047479115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0370432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0370692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0370856Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0371223Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0371425Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0371529Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0371626Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0371721Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0371724Z 2025-12-04T13:44:26.0371963Z [rank3]:[W1204 13:33:34.049821194 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0372135Z [rank2]:[W1204 13:33:34.085031060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0372309Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0372564Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0372727Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0373123Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0373337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0373441Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0373538Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0373637Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0373652Z 2025-12-04T13:44:26.0373886Z [rank2]:[W1204 13:33:34.086277002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0374056Z [rank1]:[W1204 13:33:34.665713837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0374231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0374494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0374660Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0375026Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0375228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0375332Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0375427Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0375525Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0375527Z 2025-12-04T13:44:26.0375760Z [rank1]:[W1204 13:33:34.667409689 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0375931Z [rank3]:[W1204 13:33:35.049985880 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0376106Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0376364Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0376529Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0376912Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0377134Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0377238Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0377334Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0377429Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0377430Z 2025-12-04T13:44:26.0377705Z [rank3]:[W1204 13:33:35.051818590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0377894Z [rank2]:[W1204 13:33:35.086442739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0378072Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0378328Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0378492Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0378863Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0379068Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0379175Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0379272Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0379373Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0379375Z 2025-12-04T13:44:26.0379611Z [rank2]:[W1204 13:33:35.088534303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0379781Z [rank1]:[W1204 13:33:35.667539407 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0379958Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0380212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0380375Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0380741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0380971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0381089Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0381184Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0381280Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0381282Z 2025-12-04T13:44:26.0381513Z [rank1]:[W1204 13:33:35.669025104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0381697Z [rank3]:[W1204 13:33:36.052015996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0381875Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0382131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0382293Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0382662Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0382864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0382971Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0383069Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0383164Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0383166Z 2025-12-04T13:44:26.0383400Z [rank3]:[W1204 13:33:36.053297908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0383570Z [rank2]:[W1204 13:33:36.088684210 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0383746Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0384003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0384168Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0384537Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0384740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0384872Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0384979Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0385076Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0385078Z 2025-12-04T13:44:26.0385318Z [rank2]:[W1204 13:33:36.090560799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0385490Z [rank1]:[W1204 13:33:36.669181971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0385678Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0385934Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0386098Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0386463Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0386667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0386773Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0386868Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0386965Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0386966Z 2025-12-04T13:44:26.0387199Z [rank1]:[W1204 13:33:36.670714507 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0387371Z [rank3]:[W1204 13:33:37.053441185 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0387593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0387864Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0388030Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0388400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0388601Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0388706Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0388832Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0388939Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0388941Z 2025-12-04T13:44:26.0389176Z [rank3]:[W1204 13:33:37.055736565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0389346Z [rank2]:[W1204 13:33:37.090702706 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0389521Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0389796Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0389961Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0390333Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0390535Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0390641Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0390737Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0390835Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0390837Z 2025-12-04T13:44:26.0391069Z [rank2]:[W1204 13:33:37.091937179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0391238Z [rank1]:[W1204 13:33:37.670868445 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0391412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0391667Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0391833Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0392206Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0392406Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0392513Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0392608Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0392726Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0392745Z 2025-12-04T13:44:26.0392980Z [rank1]:[W1204 13:33:37.672395891 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0393151Z [rank3]:[W1204 13:33:38.055903352 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0393327Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0393583Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0393761Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0394131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0394335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0394438Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0394536Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0394631Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0394634Z 2025-12-04T13:44:26.0394869Z [rank3]:[W1204 13:33:38.057186524 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0395039Z [rank2]:[W1204 13:33:38.092098936 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0395213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0395471Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0395635Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0396006Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0396210Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0396353Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0396449Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0396549Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0396551Z 2025-12-04T13:44:26.0396808Z [rank2]:[W1204 13:33:38.093955465 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0396987Z [rank1]:[W1204 13:33:38.672544759 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0397162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0397414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0397626Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0397993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0398195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0398300Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0398395Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0398492Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0398494Z 2025-12-04T13:44:26.0398729Z [rank1]:[W1204 13:33:38.673805711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0398900Z [rank3]:[W1204 13:33:39.057358861 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0399073Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0399331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0399497Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0399865Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0400067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0400170Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0400266Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0400362Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0400365Z 2025-12-04T13:44:26.0400630Z [rank3]:[W1204 13:33:39.058594874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0400814Z [rank2]:[W1204 13:33:39.094085044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0400987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0401242Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0401404Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0401791Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0401993Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0402098Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0402193Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0402290Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0402293Z 2025-12-04T13:44:26.0402526Z [rank2]:[W1204 13:33:39.095363776 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0402697Z [rank1]:[W1204 13:33:39.673970958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0402878Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0403140Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0403302Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0403671Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0403874Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0403981Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0404077Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0404176Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0404178Z 2025-12-04T13:44:26.0404413Z [rank1]:[W1204 13:33:39.675241161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0404605Z [rank3]:[W1204 13:33:40.058779721 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0404789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0405046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0405210Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0405580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0405795Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0405898Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0405994Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0406089Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0406090Z 2025-12-04T13:44:26.0406324Z [rank3]:[W1204 13:33:40.060920284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0406496Z [rank2]:[W1204 13:33:40.095512884 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0406674Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0406930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0407091Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0407461Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0407699Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0407808Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0407906Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0408003Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0408005Z 2025-12-04T13:44:26.0408240Z [rank2]:[W1204 13:33:40.096746416 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0408412Z [rank1]:[W1204 13:33:40.675409288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0408621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0408886Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0409048Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0409417Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0409636Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0409742Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0409837Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0409933Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0409935Z 2025-12-04T13:44:26.0410167Z [rank1]:[W1204 13:33:40.676917485 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0410345Z [rank3]:[W1204 13:33:41.061105281 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0410522Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0410777Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0410939Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0411302Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0411505Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0411613Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0411712Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0411808Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0411810Z 2025-12-04T13:44:26.0412042Z [rank3]:[W1204 13:33:41.062349344 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0412214Z [rank2]:[W1204 13:33:41.096931874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0412388Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0412668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0412842Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0413210Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0413425Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0413531Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0413627Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0413724Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0413726Z 2025-12-04T13:44:26.0413965Z [rank2]:[W1204 13:33:41.098494689 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0414136Z [rank1]:[W1204 13:33:41.677325618 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0414314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0414569Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0414734Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0415104Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0415308Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0415412Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0415508Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0415605Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0415606Z 2025-12-04T13:44:26.0415838Z [rank1]:[W1204 13:33:41.678815055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0416008Z [rank3]:[W1204 13:33:42.062533931 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0416184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0416468Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0416647Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0417012Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0417213Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0417328Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0417424Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0417560Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0417562Z 2025-12-04T13:44:26.0417796Z [rank3]:[W1204 13:33:42.064632465 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0417966Z [rank2]:[W1204 13:33:42.098649008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0418140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0418402Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0418566Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0418937Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0419138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0419243Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0419339Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0419436Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0419438Z 2025-12-04T13:44:26.0419670Z [rank2]:[W1204 13:33:42.100386959 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0419840Z [rank1]:[W1204 13:33:42.678972533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0420018Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0420275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0420474Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0420852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0421051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0421171Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0421269Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0421370Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0421374Z 2025-12-04T13:44:26.0421608Z [rank1]:[W1204 13:33:42.680782713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0421779Z [rank3]:[W1204 13:33:43.064811313 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0421954Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0422209Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0422375Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0422743Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0422945Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0423048Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0423145Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0423240Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0423244Z 2025-12-04T13:44:26.0423478Z [rank3]:[W1204 13:33:43.066917297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0423648Z [rank2]:[W1204 13:33:43.100532248 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0423823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0424078Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0424259Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0424649Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0424851Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0424957Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0425057Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0425168Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0425170Z 2025-12-04T13:44:26.0425405Z [rank2]:[W1204 13:33:43.102300619 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0425575Z [rank1]:[W1204 13:33:43.680915293 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0425749Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0426002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0426170Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0426540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0426746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0426851Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0426945Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0427042Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0427044Z 2025-12-04T13:44:26.0427277Z [rank1]:[W1204 13:33:43.682153365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0427449Z [rank3]:[W1204 13:33:44.067075235 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0427667Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0427921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0428085Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0428485Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0428702Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0428810Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0428908Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0429004Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0429023Z 2025-12-04T13:44:26.0429257Z [rank3]:[W1204 13:33:44.069061282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0429429Z [rank2]:[W1204 13:33:44.102442368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0429602Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0429857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0430021Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0430393Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0430596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0430699Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0430795Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0430891Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0430894Z 2025-12-04T13:44:26.0431136Z [rank2]:[W1204 13:33:44.103664361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0431309Z [rank1]:[W1204 13:33:44.682317294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0431488Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0431742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0431905Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0432292Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0432504Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0432610Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0432704Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0432800Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0432802Z 2025-12-04T13:44:26.0433035Z [rank1]:[W1204 13:33:44.684391448 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0433223Z [rank3]:[W1204 13:33:45.069220021 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0433399Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0433656Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0433821Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0434188Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0434392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0434496Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0434592Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0434688Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0434690Z 2025-12-04T13:44:26.0434920Z [rank3]:[W1204 13:33:45.071536200 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0435095Z [rank2]:[W1204 13:33:45.103811250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0435271Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0435528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0435690Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0436056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0436285Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0436399Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0436497Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0436593Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0436595Z 2025-12-04T13:44:26.0436828Z [rank2]:[W1204 13:33:45.105971983 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0437014Z [rank1]:[W1204 13:33:45.684550707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0437194Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0437450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0437666Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0438031Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0438234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0438339Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0438434Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0438530Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0438532Z 2025-12-04T13:44:26.0438765Z [rank1]:[W1204 13:33:45.686186791 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0438937Z [rank3]:[W1204 13:33:46.071719478 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0439112Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0439371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0439535Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0439903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0440124Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0440254Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0440350Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0440446Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0440448Z 2025-12-04T13:44:26.0440680Z [rank3]:[W1204 13:33:46.073650786 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0440850Z [rank2]:[W1204 13:33:46.106147362 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0441038Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0441297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0441460Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0441830Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0442034Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0442140Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0442239Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0442336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0442338Z 2025-12-04T13:44:26.0442572Z [rank2]:[W1204 13:33:46.107988601 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0442740Z [rank1]:[W1204 13:33:46.686356341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0442916Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0443170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0443333Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0443704Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0443905Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0444024Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0444139Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0444235Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0444237Z 2025-12-04T13:44:26.0444469Z [rank1]:[W1204 13:33:46.688712209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0444640Z [rank3]:[W1204 13:33:47.073822626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0444829Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0445087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0445252Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0445620Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0445824Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0445931Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0446027Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0446126Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0446127Z 2025-12-04T13:44:26.0446362Z [rank3]:[W1204 13:33:47.075827051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0446533Z [rank2]:[W1204 13:33:47.108173181 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0446705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0446963Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0447126Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0447530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0447731Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0447835Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0447931Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0448052Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0448067Z 2025-12-04T13:44:26.0448303Z [rank2]:[W1204 13:33:47.110213336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0448473Z [rank1]:[W1204 13:33:47.688872329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0448649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0448924Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0449087Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0449455Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0449656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0449761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0449858Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0449957Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0449960Z 2025-12-04T13:44:26.0450195Z [rank1]:[W1204 13:33:47.691007522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0450365Z [rank3]:[W1204 13:33:48.075988441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0450541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0450796Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0450962Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0451330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0451532Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0451637Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0451732Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0451829Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0451846Z 2025-12-04T13:44:26.0452115Z [rank3]:[W1204 13:33:48.078142673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0452292Z [rank2]:[W1204 13:33:48.110343226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0452468Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0452726Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0452905Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0453274Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0453476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0453581Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0453679Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0453775Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0453778Z 2025-12-04T13:44:26.0454012Z [rank2]:[W1204 13:33:48.112773492 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0454182Z [rank1]:[W1204 13:33:48.691160251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0454356Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0454612Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0454775Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0455142Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0455343Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0455448Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0455542Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0455641Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0455643Z 2025-12-04T13:44:26.0455896Z [rank1]:[W1204 13:33:48.692807995 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0456078Z [rank3]:[W1204 13:33:49.078322103 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0456255Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0456510Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0456690Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0457060Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0457263Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0457371Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0457465Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0457610Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0457613Z 2025-12-04T13:44:26.0457847Z [rank3]:[W1204 13:33:49.080744249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0458019Z [rank2]:[W1204 13:33:49.112913912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0458191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0458447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0458614Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0458985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0459187Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0459290Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0459388Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0459485Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0459487Z 2025-12-04T13:44:26.0459721Z [rank2]:[W1204 13:33:49.114138985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0459928Z [rank1]:[W1204 13:33:49.692973285 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0460120Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0460374Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0460535Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0460923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0461126Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0461235Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0461329Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0461428Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0461430Z 2025-12-04T13:44:26.0461664Z [rank1]:[W1204 13:33:49.694516561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0461835Z [rank3]:[W1204 13:33:50.080924439 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0462009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0462262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0462426Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0462791Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0462995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0463101Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0463195Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0463294Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0463296Z 2025-12-04T13:44:26.0463530Z [rank3]:[W1204 13:33:50.082995273 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0463715Z [rank2]:[W1204 13:33:50.114308745 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0463908Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0464162Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0464326Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0464693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0464908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0465012Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0465109Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0465205Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0465207Z 2025-12-04T13:44:26.0465440Z [rank2]:[W1204 13:33:50.116488647 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0465610Z [rank1]:[W1204 13:33:50.694644341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0465786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0466043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0466207Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0466571Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0466773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0466879Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0466975Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0467071Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0467073Z 2025-12-04T13:44:26.0467312Z [rank1]:[W1204 13:33:50.696924441 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0467518Z [rank3]:[W1204 13:33:51.083132394 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0467712Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0467998Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0468160Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0468527Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0468753Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0468860Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0468954Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0469050Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0469053Z 2025-12-04T13:44:26.0469285Z [rank3]:[W1204 13:33:51.085087921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0469456Z [rank2]:[W1204 13:33:51.116619808 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0469633Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0469892Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0470056Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0470420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0470622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0470726Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0470828Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0470925Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0470926Z 2025-12-04T13:44:26.0471159Z [rank2]:[W1204 13:33:51.117844521 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0471329Z [rank1]:[W1204 13:33:51.697066292 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0471503Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0471785Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0471961Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0472326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0472526Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0472641Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0472738Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0472835Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0472837Z 2025-12-04T13:44:26.0473073Z [rank1]:[W1204 13:33:51.699449269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0473241Z [rank3]:[W1204 13:33:52.085267641 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0473416Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0473672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0473837Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0474210Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0474410Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0474517Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0474612Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0474710Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0474712Z 2025-12-04T13:44:26.0474944Z [rank3]:[W1204 13:33:52.087230067 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0475113Z [rank2]:[W1204 13:33:52.118005991 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0475290Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0475557Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0475743Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0476110Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0476314Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0476430Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0476526Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0476623Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0476626Z 2025-12-04T13:44:26.0476857Z [rank2]:[W1204 13:33:52.120304101 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0477026Z [rank1]:[W1204 13:33:52.699605250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0477199Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0477458Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0477655Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0478022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0478224Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0478334Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0478433Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0478531Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0478534Z 2025-12-04T13:44:26.0478767Z [rank1]:[W1204 13:33:52.701885910 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0478938Z [rank3]:[W1204 13:33:53.087405678 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0479115Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0479371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0479556Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0479947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0480150Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0480255Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0480366Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0480462Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0480464Z 2025-12-04T13:44:26.0480702Z [rank3]:[W1204 13:33:53.089268157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0480873Z [rank2]:[W1204 13:33:53.120453851 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0481048Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0481300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0481464Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0481832Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0482034Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0482139Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0482234Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0482331Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0482334Z 2025-12-04T13:44:26.0482568Z [rank2]:[W1204 13:33:53.121763642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0482740Z [rank1]:[W1204 13:33:53.701996981 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0482915Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0483168Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0483331Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0483729Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0483939Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0484042Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0484137Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0484242Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0484245Z 2025-12-04T13:44:26.0484478Z [rank1]:[W1204 13:33:53.704370199 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0484650Z [rank3]:[W1204 13:33:54.089675172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0484827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0485084Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0485249Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0485617Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0485820Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0485929Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0486023Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0486119Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0486122Z 2025-12-04T13:44:26.0486354Z [rank3]:[W1204 13:33:54.091481292 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0486526Z [rank2]:[W1204 13:33:54.121907864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0486700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0486954Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0487120Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0487548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0487764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0487868Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0487966Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0488062Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0488078Z 2025-12-04T13:44:26.0488310Z [rank2]:[W1204 13:33:54.124249352 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0488483Z [rank1]:[W1204 13:33:54.704469971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0488659Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0488916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0489079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0489449Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0489651Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0489754Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0489849Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0489944Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0489946Z 2025-12-04T13:44:26.0490179Z [rank1]:[W1204 13:33:54.706643174 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0490350Z [rank3]:[W1204 13:33:55.091667742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0490524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0490777Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0490939Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0491323Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0491556Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0491661Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0491756Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0491851Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0491853Z 2025-12-04T13:44:26.0492086Z [rank3]:[W1204 13:33:55.093381305 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0492270Z [rank2]:[W1204 13:33:55.124357854 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0492447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0492700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0492864Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0493234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0493443Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0493547Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0493642Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0493739Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0493741Z 2025-12-04T13:44:26.0493975Z [rank2]:[W1204 13:33:55.125619856 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0494145Z [rank1]:[W1204 13:33:55.707061969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0494319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0494577Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0494738Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0495103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0495326Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0495440Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0495536Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0495633Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0495635Z 2025-12-04T13:44:26.0495871Z [rank1]:[W1204 13:33:55.709093794 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0496041Z [rank3]:[W1204 13:33:56.093561885 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0496235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0496490Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0496655Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0497026Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0497228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0497334Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0497429Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0497561Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0497563Z 2025-12-04T13:44:26.0497796Z [rank3]:[W1204 13:33:56.094802778 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0497972Z [rank2]:[W1204 13:33:56.125758328 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0498149Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0498403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0498569Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0498935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0499138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0499255Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0499377Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0499475Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0499477Z 2025-12-04T13:44:26.0499707Z [rank2]:[W1204 13:33:56.127273265 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0499876Z [rank1]:[W1204 13:33:56.709230026 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0500065Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0500324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0500487Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0500853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0501059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0501165Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0501262Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0501358Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0501360Z 2025-12-04T13:44:26.0501593Z [rank1]:[W1204 13:33:56.710853540 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0501762Z [rank3]:[W1204 13:33:57.094971699 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0501937Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0502204Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0502367Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0502734Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0502935Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0503040Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0503148Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0503267Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0503270Z 2025-12-04T13:44:26.0503501Z [rank3]:[W1204 13:33:57.097087403 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0503673Z [rank2]:[W1204 13:33:57.127414466 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0503847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0504111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0504280Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0504651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0504853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0504960Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0505055Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0505153Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0505156Z 2025-12-04T13:44:26.0505386Z [rank2]:[W1204 13:33:57.128914883 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0505557Z [rank1]:[W1204 13:33:57.711035741 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0505730Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0505987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0506150Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0506518Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0506723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0506828Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0506924Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0507031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0507043Z 2025-12-04T13:44:26.0507286Z [rank1]:[W1204 13:33:57.713316661 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0507456Z [rank3]:[W1204 13:33:58.097258184 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0507672Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0507931Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0508110Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0508476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0508678Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0508782Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0508877Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0508974Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0508976Z 2025-12-04T13:44:26.0509213Z [rank3]:[W1204 13:33:58.099570263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0509383Z [rank2]:[W1204 13:33:58.129049895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0509559Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0509815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0509979Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0510345Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0510549Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0510654Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0510750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0510848Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0510850Z 2025-12-04T13:44:26.0511110Z [rank2]:[W1204 13:33:58.130249249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0511303Z [rank1]:[W1204 13:33:58.713495552 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0511477Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0511733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0511906Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0512277Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0512481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0512584Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0512679Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0512776Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0512778Z 2025-12-04T13:44:26.0513015Z [rank1]:[W1204 13:33:58.715683754 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0513186Z [rank3]:[W1204 13:33:59.099764164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0513365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0513623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0513786Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0514158Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0514358Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0514463Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0514557Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0514654Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0514657Z 2025-12-04T13:44:26.0514902Z [rank3]:[W1204 13:33:59.101993745 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0515093Z [rank2]:[W1204 13:33:59.130380411 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0515269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0515525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0515690Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0516072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0516280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0516387Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0516482Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0516579Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0516581Z 2025-12-04T13:44:26.0516817Z [rank2]:[W1204 13:33:59.131621024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0516988Z [rank1]:[W1204 13:33:59.715874525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0517163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0517420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0517635Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0518004Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0518206Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0518310Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0518411Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0518507Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0518509Z 2025-12-04T13:44:26.0518743Z [rank1]:[W1204 13:33:59.717539229 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0518939Z [rank3]:[W1204 13:34:00.102139837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0519125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0519381Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0519544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0519911Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0520129Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0520235Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0520330Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0520425Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0520427Z 2025-12-04T13:44:26.0520661Z [rank3]:[W1204 13:34:00.104234451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0520831Z [rank2]:[W1204 13:34:00.131764426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0521007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0521262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0521425Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0521793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0521997Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0522104Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0522200Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0522296Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0522298Z 2025-12-04T13:44:26.0522530Z [rank2]:[W1204 13:34:00.133011339 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0522702Z [rank1]:[W1204 13:34:00.717688591 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0522896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0523161Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0523324Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0523687Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0523903Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0524008Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0524103Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0524199Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0524201Z 2025-12-04T13:44:26.0524437Z [rank1]:[W1204 13:34:00.719214257 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0524609Z [rank3]:[W1204 13:34:01.104406163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0524787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0525045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0525208Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0525575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0525775Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0525884Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0525980Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0526075Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0526077Z 2025-12-04T13:44:26.0526310Z [rank3]:[W1204 13:34:01.105791953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0526480Z [rank2]:[W1204 13:34:01.133145412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0526657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0526940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0527118Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0527526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0527740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0527845Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0527942Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0528039Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0528040Z 2025-12-04T13:44:26.0528273Z [rank2]:[W1204 13:34:01.134609519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0528447Z [rank1]:[W1204 13:34:01.719358230 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0528623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0528883Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0529047Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0529415Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0529616Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0529721Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0529817Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0529913Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0529915Z 2025-12-04T13:44:26.0530147Z [rank1]:[W1204 13:34:01.720604923 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0530317Z [rank3]:[W1204 13:34:02.105975675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0530493Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0530774Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0530950Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0531324Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0531528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0531644Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0531741Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0531838Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0531840Z 2025-12-04T13:44:26.0532072Z [rank3]:[W1204 13:34:02.107972971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0532241Z [rank2]:[W1204 13:34:02.134718433 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0532415Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0532670Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0532839Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0533212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0533413Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0533519Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0533614Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0533714Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0533716Z 2025-12-04T13:44:26.0533950Z [rank2]:[W1204 13:34:02.136204950 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0534123Z [rank1]:[W1204 13:34:02.720742086 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0534298Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0534552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0534737Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0535114Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0535316Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0535420Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0535527Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0535623Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0535627Z 2025-12-04T13:44:26.0535861Z [rank1]:[W1204 13:34:02.721989138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0536037Z [rank3]:[W1204 13:34:03.108161732 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0536214Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0536471Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0536637Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0537005Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0537212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0537315Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0537412Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0537548Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0537550Z 2025-12-04T13:44:26.0537789Z [rank3]:[W1204 13:34:03.111118818 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0537958Z [rank2]:[W1204 13:34:03.136347773 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0538133Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0538387Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0538552Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0538943Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0539157Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0539263Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0539358Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0539469Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0539471Z 2025-12-04T13:44:26.0539709Z [rank2]:[W1204 13:34:03.137600206 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0539882Z [rank1]:[W1204 13:34:03.722070663 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0540058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0540311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0540477Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0540843Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0541046Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0541153Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0541249Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0541347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0541350Z 2025-12-04T13:44:26.0541585Z [rank1]:[W1204 13:34:03.723293186 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0541756Z [rank3]:[W1204 13:34:04.111301200 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0541934Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0542199Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0542364Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0542764Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0542975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0543079Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0543174Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0543270Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0543282Z 2025-12-04T13:44:26.0543519Z [rank3]:[W1204 13:34:04.113442703 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0543690Z [rank2]:[W1204 13:34:04.137737089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0543865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0544120Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0544286Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0544658Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0544859Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0544964Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0545059Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0545157Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0545159Z 2025-12-04T13:44:26.0545396Z [rank2]:[W1204 13:34:04.138961832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0545567Z [rank1]:[W1204 13:34:04.723449819 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0545744Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0546000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0546162Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0546546Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0546766Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0546870Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0546966Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0547062Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0547064Z 2025-12-04T13:44:26.0547298Z [rank1]:[W1204 13:34:04.725680020 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0547516Z [rank3]:[W1204 13:34:05.113626305 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0547694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0547950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0548111Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0548480Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0548686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0548793Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0548889Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0548984Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0548986Z 2025-12-04T13:44:26.0549220Z [rank3]:[W1204 13:34:05.115727169 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0549392Z [rank2]:[W1204 13:34:05.139071796 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0549568Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0549826Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0549988Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0550354Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0550585Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0550702Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0550801Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0550904Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0550906Z 2025-12-04T13:44:26.0551142Z [rank2]:[W1204 13:34:05.140303419 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0551325Z [rank1]:[W1204 13:34:05.725859572 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0551502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0551755Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0551921Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0552287Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0552490Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0552597Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0552692Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0552788Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0552790Z 2025-12-04T13:44:26.0553027Z [rank1]:[W1204 13:34:05.727980586 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0553197Z [rank3]:[W1204 13:34:06.115903722 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0553375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0553632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0553795Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0554164Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0554367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0554490Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0554597Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0554693Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0554695Z 2025-12-04T13:44:26.0554928Z [rank3]:[W1204 13:34:06.118046235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0555100Z [rank2]:[W1204 13:34:06.140519381 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0555297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0555554Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0555717Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0556087Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0556289Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0556396Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0556493Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0556590Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0556591Z 2025-12-04T13:44:26.0556825Z [rank2]:[W1204 13:34:06.142572336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0556994Z [rank1]:[W1204 13:34:06.728097980 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0557170Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0557432Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0557628Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0557994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0558195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0558302Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0558430Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0558540Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0558542Z 2025-12-04T13:44:26.0558774Z [rank1]:[W1204 13:34:06.729889290 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0558944Z [rank3]:[W1204 13:34:07.118226017 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0559118Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0559392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0559556Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0559925Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0560127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0560232Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0560329Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0560426Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0560430Z 2025-12-04T13:44:26.0560666Z [rank3]:[W1204 13:34:07.120512397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0560837Z [rank2]:[W1204 13:34:07.142711899 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0561012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0561270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0561435Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0561812Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0562016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0562123Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0562219Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0562336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0562347Z 2025-12-04T13:44:26.0562581Z [rank2]:[W1204 13:34:07.144217446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0562750Z [rank1]:[W1204 13:34:07.730032684 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0562927Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0563181Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0563358Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0563724Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0563931Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0564039Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0564135Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0564231Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0564233Z 2025-12-04T13:44:26.0564466Z [rank1]:[W1204 13:34:07.732488870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0564641Z [rank3]:[W1204 13:34:08.120877196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0564814Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0565071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0565236Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0565605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0565805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0565908Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0566005Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0566102Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0566104Z 2025-12-04T13:44:26.0566363Z [rank3]:[W1204 13:34:08.123341362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0566543Z [rank2]:[W1204 13:34:08.144354560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0566718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0566973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0567145Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0567557Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0567758Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0567864Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0567960Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0568058Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0568059Z 2025-12-04T13:44:26.0568297Z [rank2]:[W1204 13:34:08.145584723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0568470Z [rank1]:[W1204 13:34:08.732630214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0568644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0568899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0569063Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0569430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0569633Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0569738Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0569832Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0569929Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0569932Z 2025-12-04T13:44:26.0570189Z [rank1]:[W1204 13:34:08.735078801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0570378Z [rank3]:[W1204 13:34:09.123481546 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0570554Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0570811Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0570974Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0571354Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0571556Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0571660Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0571760Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0571856Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0571859Z 2025-12-04T13:44:26.0572093Z [rank3]:[W1204 13:34:09.124700599 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0572263Z [rank2]:[W1204 13:34:09.145736757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0572440Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0572700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0572866Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0573235Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0573437Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0573545Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0573640Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0573736Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0573739Z 2025-12-04T13:44:26.0573973Z [rank2]:[W1204 13:34:09.146940081 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0574164Z [rank1]:[W1204 13:34:09.735229435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0574350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0574605Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0574772Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0575146Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0575359Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0575463Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0575558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0575655Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0575657Z 2025-12-04T13:44:26.0575887Z [rank1]:[W1204 13:34:09.737735419 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0576060Z [rank3]:[W1204 13:34:10.124832344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0576237Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0576494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0576656Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0577025Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0577234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0577338Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0577433Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0577579Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0577581Z 2025-12-04T13:44:26.0577815Z [rank3]:[W1204 13:34:10.126049457 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0577987Z [rank2]:[W1204 13:34:10.147154004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0578188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0578459Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0578622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0578988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0579205Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0579312Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0579410Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0579508Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0579509Z 2025-12-04T13:44:26.0579743Z [rank2]:[W1204 13:34:10.149311526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0579914Z [rank1]:[W1204 13:34:10.737873424 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0580090Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0580346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0580511Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0580876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0581078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0581184Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0581278Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0581375Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0581377Z 2025-12-04T13:44:26.0581614Z [rank1]:[W1204 13:34:10.740367759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0581786Z [rank3]:[W1204 13:34:11.126222751 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0581961Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0582238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0582411Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0582777Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0582990Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0583096Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0583193Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0583289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0583291Z 2025-12-04T13:44:26.0583524Z [rank3]:[W1204 13:34:11.127481263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0583697Z [rank2]:[W1204 13:34:11.149483210 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0583874Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0584130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0584293Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0584660Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0584862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0584969Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0585068Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0585165Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0585167Z 2025-12-04T13:44:26.0585399Z [rank2]:[W1204 13:34:11.151374149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0585569Z [rank1]:[W1204 13:34:11.740535843 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0585749Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0586034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0586208Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0586575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0586775Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0586891Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0586988Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0587086Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0587087Z 2025-12-04T13:44:26.0587319Z [rank1]:[W1204 13:34:11.742320734 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0587530Z [rank3]:[W1204 13:34:12.127665807 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0587703Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0587964Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0588132Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0588497Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0588699Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0588805Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0588902Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0589002Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0589004Z 2025-12-04T13:44:26.0589240Z [rank3]:[W1204 13:34:12.129673813 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0589410Z [rank2]:[W1204 13:34:12.151538353 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0589584Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0589839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0590030Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0590416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0590618Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0590737Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0590833Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0590931Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0590935Z 2025-12-04T13:44:26.0591176Z [rank2]:[W1204 13:34:12.153642157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0591223Z PASSED [217.0869s] [ 8%] 2025-12-04T13:44:26.0591480Z distributed/test_dynamo_distributed.py::TestMultiProc::test_get_pg_attr I1204 13:34:12.288000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 76342 2025-12-04T13:44:26.0591632Z I1204 13:34:12.288000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 76343 2025-12-04T13:44:26.0591783Z I1204 13:34:12.289000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 76344 2025-12-04T13:44:26.0591931Z I1204 13:34:12.289000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 76345 2025-12-04T13:44:26.0592104Z [rank1]:[W1204 13:34:12.742479039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0592281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0592539Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0592705Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0593075Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0593276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0593381Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0593479Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0593576Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0593582Z 2025-12-04T13:44:26.0593815Z [rank1]:[W1204 13:34:12.743735061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0594010Z [rank3]:[W1204 13:34:13.129873597 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0594195Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0594451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0594615Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0594997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0595200Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0595304Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0595400Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0595496Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0595497Z 2025-12-04T13:44:26.0595731Z [rank3]:[W1204 13:34:13.132282634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0595904Z [rank2]:[W1204 13:34:13.153823911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0596080Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0596335Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0596499Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0596876Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0597081Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0597186Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0597281Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0597377Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0597379Z 2025-12-04T13:44:26.0597649Z [rank2]:[W1204 13:34:13.156218438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0597836Z [rank1]:[W1204 13:34:13.743890446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0598040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0598297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0603615Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0603998Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0604234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0604342Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0604442Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0604537Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0604541Z 2025-12-04T13:44:26.0604775Z [rank1]:[W1204 13:34:13.745292875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0604950Z [rank3]:[W1204 13:34:14.132483128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0605129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0605389Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0605552Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0605919Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0606128Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0606234Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0606331Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0606427Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0606429Z 2025-12-04T13:44:26.0606668Z [rank3]:[W1204 13:34:14.134787487 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0606839Z [rank2]:[W1204 13:34:14.156376503 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0607040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0607318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0607537Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0607903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0608120Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0608227Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0608325Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0608424Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0608426Z 2025-12-04T13:44:26.0608658Z [rank2]:[W1204 13:34:14.158770340 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0608831Z [rank1]:[W1204 13:34:14.745476029 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0609009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0609266Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0609432Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0609796Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0610001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0610106Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0610205Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0610300Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0610303Z 2025-12-04T13:44:26.0610538Z [rank1]:[W1204 13:34:14.747014336 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0610709Z [rank3]:[W1204 13:34:15.134948392 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0610884Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0611169Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0611344Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0611717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0611931Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0612035Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0612134Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0612229Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0612231Z 2025-12-04T13:44:26.0612465Z [rank3]:[W1204 13:34:15.136172975 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0612635Z [rank2]:[W1204 13:34:15.158920085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0612817Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0613072Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0613238Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0613608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0613812Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0613918Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0614014Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0614113Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0614115Z 2025-12-04T13:44:26.0614347Z [rank2]:[W1204 13:34:15.160854593 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0614516Z [rank1]:[W1204 13:34:15.747196560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0614692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0614958Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0615143Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0615516Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0615718Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0615833Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0615932Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0616031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0616035Z 2025-12-04T13:44:26.0616267Z [rank1]:[W1204 13:34:15.748920462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0616436Z [rank3]:[W1204 13:34:16.136323520 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0616610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0616866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0617031Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0617396Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0617638Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0617743Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0617840Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0617936Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0617939Z 2025-12-04T13:44:26.0618176Z [rank3]:[W1204 13:34:16.138276578 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0618347Z [rank2]:[W1204 13:34:16.161024998 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0618523Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0618778Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0618970Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0619352Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0619553Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0619659Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0619776Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0619875Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0619878Z 2025-12-04T13:44:26.0620112Z [rank2]:[W1204 13:34:16.163203480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0620283Z [rank1]:[W1204 13:34:16.749105787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0620459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0620712Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0620876Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0621242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0621445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0621550Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0621646Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0621744Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0621746Z 2025-12-04T13:44:26.0621983Z [rank1]:[W1204 13:34:16.750880008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0622154Z [rank3]:[W1204 13:34:17.138484312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0622328Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0622588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0622752Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0623140Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0623351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0623454Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0623550Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0623658Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0623661Z 2025-12-04T13:44:26.0623896Z [rank3]:[W1204 13:34:17.140452709 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0624068Z [rank2]:[W1204 13:34:17.163348295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0624244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0624498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0624662Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0625035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0625236Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0625341Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0625437Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0625533Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0625536Z 2025-12-04T13:44:26.0625770Z [rank2]:[W1204 13:34:17.164570409 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0625949Z [rank1]:[W1204 13:34:17.751057643 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0626126Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0626388Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0626555Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0626951Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0627162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0627265Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0627361Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0627457Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0627470Z 2025-12-04T13:44:26.0627732Z [rank1]:[W1204 13:34:17.752858533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0627903Z [rank3]:[W1204 13:34:18.140605374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0628081Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0628339Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0628502Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0628873Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0629077Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0629181Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0629276Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0629372Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0629373Z 2025-12-04T13:44:26.0629607Z [rank3]:[W1204 13:34:18.142574071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0629779Z [rank2]:[W1204 13:34:18.164674085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0629955Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0630212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0630374Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0630762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0630990Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0631096Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0631193Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0631292Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0631294Z 2025-12-04T13:44:26.0631529Z [rank2]:[W1204 13:34:18.165840320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0631713Z [rank1]:[W1204 13:34:18.753046808 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0631889Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0632142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0632306Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0632674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0632880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0632989Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0633086Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0633182Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0633184Z 2025-12-04T13:44:26.0633418Z [rank1]:[W1204 13:34:18.754821229 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0633592Z [rank3]:[W1204 13:34:19.142705177 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0633770Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0634026Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0634188Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0634553Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0634777Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0634900Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0634999Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0635095Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0635096Z 2025-12-04T13:44:26.0635335Z [rank3]:[W1204 13:34:19.144818721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0635522Z [rank2]:[W1204 13:34:19.165979866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0635699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0635957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0636120Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0636487Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0636690Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0636797Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0636893Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0636989Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0636991Z 2025-12-04T13:44:26.0637229Z [rank2]:[W1204 13:34:19.167278327 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0637404Z [rank1]:[W1204 13:34:19.754998765 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0637638Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0637894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0638057Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0638424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0638630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0638765Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0638874Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0638972Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0638974Z 2025-12-04T13:44:26.0639206Z [rank1]:[W1204 13:34:19.756790966 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0639378Z [rank3]:[W1204 13:34:20.145007046 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0639572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0639833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0639995Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0640361Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0640563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0640667Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0640766Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0640862Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0640864Z 2025-12-04T13:44:26.0641095Z [rank3]:[W1204 13:34:20.146377706 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0641266Z [rank2]:[W1204 13:34:20.167424664 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0641446Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0641706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0641870Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0642237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0642439Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0642548Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0642658Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0642775Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0642777Z 2025-12-04T13:44:26.0643011Z [rank2]:[W1204 13:34:20.168785924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0643179Z [rank1]:[W1204 13:34:20.756940512 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0643355Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0643624Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0643791Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0644163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0644367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0644474Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0644568Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0644667Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0644669Z 2025-12-04T13:44:26.0644904Z [rank1]:[W1204 13:34:20.758189444 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0645077Z [rank3]:[W1204 13:34:21.146559892 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0645252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0645510Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0645675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0646040Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0646244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0646349Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0646446Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0646552Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0646574Z 2025-12-04T13:44:26.0646808Z [rank3]:[W1204 13:34:21.148168826 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0646981Z [rank2]:[W1204 13:34:21.168926610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0647156Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0647414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0647636Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0648003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0648204Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0648309Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0648406Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0648506Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0648508Z 2025-12-04T13:44:26.0648748Z [rank2]:[W1204 13:34:21.170928836 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0648921Z [rank1]:[W1204 13:34:21.758331211 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0649097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0649352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0649516Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0649883Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0650087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0650192Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0650288Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0650386Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0650388Z 2025-12-04T13:44:26.0650656Z [rank1]:[W1204 13:34:21.759573544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0650840Z [rank3]:[W1204 13:34:22.148274404 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0651017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0651276Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0651459Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0651825Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0652030Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0652132Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0652230Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0652329Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0652331Z 2025-12-04T13:44:26.0652567Z [rank3]:[W1204 13:34:22.149427449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0652740Z [rank2]:[W1204 13:34:22.171081303 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0652914Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0653168Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0653333Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0653706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0653908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0654013Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0654109Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0654205Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0654207Z 2025-12-04T13:44:26.0654455Z [rank2]:[W1204 13:34:22.172909663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0654652Z [rank1]:[W1204 13:34:22.759738070 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0654829Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0655087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0655249Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0655627Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0655830Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0655937Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0656032Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0656128Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0656131Z 2025-12-04T13:44:26.0656362Z [rank1]:[W1204 13:34:22.761501291 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0656534Z [rank3]:[W1204 13:34:23.149593595 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0656709Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0656969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0657135Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0657547Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0657749Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0657853Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0657949Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0658045Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0658047Z 2025-12-04T13:44:26.0658283Z [rank3]:[W1204 13:34:23.151990892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0658484Z [rank2]:[W1204 13:34:23.173099639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0658671Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0658926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0659088Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0659462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0659680Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0659786Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0659882Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0659978Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0659980Z 2025-12-04T13:44:26.0660211Z [rank2]:[W1204 13:34:23.175409378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0660383Z [rank1]:[W1204 13:34:23.761684457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0660561Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0660817Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0660984Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0661350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0661556Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0661662Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0661756Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0661852Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0661854Z 2025-12-04T13:44:26.0662088Z [rank1]:[W1204 13:34:23.764116514 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0662134Z PASSED [11.8206s] [ 11%] 2025-12-04T13:44:26.0662412Z distributed/test_dynamo_distributed.py::TestMultiProc::test_guard_collective I1204 13:34:24.110000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 79671 2025-12-04T13:44:26.0662584Z I1204 13:34:24.110000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 79672 2025-12-04T13:44:26.0662735Z I1204 13:34:24.111000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 79673 2025-12-04T13:44:26.0662883Z I1204 13:34:24.111000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 79674 2025-12-04T13:44:26.0663060Z [rank3]:[W1204 13:34:24.152098120 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0663238Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0663518Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0663682Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0664055Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0664257Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0664365Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0664467Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0664564Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0664566Z 2025-12-04T13:44:26.0664802Z [rank3]:[W1204 13:34:24.153342443 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0664971Z [rank2]:[W1204 13:34:24.175575654 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0665147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0665403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0665571Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0665942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0666145Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0666253Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0666348Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0666475Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0666489Z 2025-12-04T13:44:26.0666723Z [rank2]:[W1204 13:34:24.176815417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0666897Z [rank1]:[W1204 13:34:24.764310020 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0667074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0667346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0667549Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0667917Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0668126Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0668230Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0668327Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0668425Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0668428Z 2025-12-04T13:44:26.0668662Z [rank1]:[W1204 13:34:24.766621269 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0668833Z [rank3]:[W1204 13:34:25.153504250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0669007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0669268Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0669435Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0669804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0670007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0670110Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0670207Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0670303Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0670305Z 2025-12-04T13:44:26.0670569Z [rank3]:[W1204 13:34:25.155887367 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0670751Z [rank2]:[W1204 13:34:25.176935875 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0670929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0671185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0671365Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0671732Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0671936Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0672041Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0672137Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0672236Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0672238Z 2025-12-04T13:44:26.0672472Z [rank2]:[W1204 13:34:25.179342062 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0672643Z [rank1]:[W1204 13:34:25.766815796 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0672820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0673074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0673239Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0673603Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0673805Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0673912Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0674014Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0674114Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0674116Z 2025-12-04T13:44:26.0674372Z [rank1]:[W1204 13:34:25.769158604 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0674554Z [rank3]:[W1204 13:34:26.156090504 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0674728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0674985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0675148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0675533Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0675736Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0675840Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0675937Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0676033Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0676036Z 2025-12-04T13:44:26.0676273Z [rank3]:[W1204 13:34:26.158306195 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0676447Z [rank2]:[W1204 13:34:26.179496949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0676624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0676878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0677041Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0677413Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0677655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0677762Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0677857Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0677955Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0677957Z 2025-12-04T13:44:26.0678189Z [rank2]:[W1204 13:34:26.181180112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0678391Z [rank1]:[W1204 13:34:26.769334711 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0678578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0678835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0679002Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0679398Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0679602Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0679704Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0679802Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0679899Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0679901Z 2025-12-04T13:44:26.0680133Z [rank1]:[W1204 13:34:26.770899287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0680307Z [rank3]:[W1204 13:34:27.158499681 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0680482Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0680739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0680902Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0681273Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0681481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0681584Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0681679Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0681778Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0681780Z 2025-12-04T13:44:26.0682013Z [rank3]:[W1204 13:34:27.160474748 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0682199Z [rank2]:[W1204 13:34:27.181337000 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0682395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0682656Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0682824Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0683196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0683412Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0683518Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0683613Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0683711Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0683713Z 2025-12-04T13:44:26.0683947Z [rank2]:[W1204 13:34:27.183314246 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0684117Z [rank1]:[W1204 13:34:27.771064944 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0684292Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0684548Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0684713Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0685082Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0685293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0685402Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0685497Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0685593Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0685595Z 2025-12-04T13:44:26.0685826Z [rank1]:[W1204 13:34:27.772804736 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0685998Z [rank3]:[W1204 13:34:28.160648905 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0686197Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0686483Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0686647Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0687011Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0687222Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0687327Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0687428Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0687569Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0687571Z 2025-12-04T13:44:26.0687806Z [rank3]:[W1204 13:34:28.161873008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0687976Z [rank2]:[W1204 13:34:28.183483414 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0688152Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0688413Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0688575Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0688946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0689148Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0689254Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0689352Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0689450Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0689452Z 2025-12-04T13:44:26.0689686Z [rank2]:[W1204 13:34:28.185412151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0689855Z [rank1]:[W1204 13:34:28.772991093 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0690035Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0690324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0690500Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0690869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0691071Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0691193Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0691292Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0691393Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0691395Z 2025-12-04T13:44:26.0691626Z [rank1]:[W1204 13:34:28.774746204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0691796Z [rank3]:[W1204 13:34:29.162039196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0691970Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0692228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0692392Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0692761Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0692963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0693068Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0693168Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0693264Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0693269Z 2025-12-04T13:44:26.0693501Z [rank3]:[W1204 13:34:29.163272409 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0693675Z [rank2]:[W1204 13:34:29.185552089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0693854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0694123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0694312Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0694678Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0694881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0695004Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0695099Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0695197Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0695200Z 2025-12-04T13:44:26.0695438Z [rank2]:[W1204 13:34:29.186846361 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0695609Z [rank1]:[W1204 13:34:29.774886992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0695784Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0696042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0696210Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0696577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0696784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0696894Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0696989Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0697085Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0697088Z 2025-12-04T13:44:26.0697324Z [rank1]:[W1204 13:34:29.776655804 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0697537Z [rank3]:[W1204 13:34:30.163450636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0697711Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0697969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0698147Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0698542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0698743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0698847Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0698965Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0699061Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0699063Z 2025-12-04T13:44:26.0699305Z [rank3]:[W1204 13:34:30.165650448 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0699478Z [rank2]:[W1204 13:34:30.187012009 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0699652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0699908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0700072Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0700438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0700639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0700754Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0700849Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0700947Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0700949Z 2025-12-04T13:44:26.0701185Z [rank2]:[W1204 13:34:30.189330368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0701360Z [rank1]:[W1204 13:34:30.776790312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0701538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0701792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0701956Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0702347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0702567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0702680Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0702778Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0702891Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0702893Z 2025-12-04T13:44:26.0703126Z [rank1]:[W1204 13:34:30.778592863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0703297Z [rank3]:[W1204 13:34:31.165736748 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0703471Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0703734Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0703895Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0704263Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0704467Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0704572Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0704680Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0705067Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0705070Z 2025-12-04T13:44:26.0705309Z [rank3]:[W1204 13:34:31.166895452 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0705482Z [rank2]:[W1204 13:34:31.189468817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0705655Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0705910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0706077Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0706470Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0706682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0706787Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0706883Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0706979Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0706981Z 2025-12-04T13:44:26.0707225Z [rank2]:[W1204 13:34:31.190682580 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0707717Z [rank1]:[W1204 13:34:31.778773011 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0707894Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0708148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0708311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0708680Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0708882Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0708986Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0709080Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0709176Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0709178Z 2025-12-04T13:44:26.0709411Z [rank1]:[W1204 13:34:31.780685609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0709583Z [rank3]:[W1204 13:34:32.167018162 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0709759Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0710015Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0710179Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0710548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0710787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0710905Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0711001Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0711096Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0711098Z 2025-12-04T13:44:26.0711332Z [rank3]:[W1204 13:34:32.168258104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0711518Z [rank2]:[W1204 13:34:32.190880288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0711692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0711950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0712112Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0712486Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0712689Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0712794Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0712889Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0712985Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0712987Z 2025-12-04T13:44:26.0713219Z [rank2]:[W1204 13:34:32.192727377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0713390Z [rank1]:[W1204 13:34:32.780844007 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0713566Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0713820Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0713982Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0714353Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0714566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0714695Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0714789Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0714885Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0714887Z 2025-12-04T13:44:26.0715118Z [rank1]:[W1204 13:34:32.782118389 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0715288Z [rank3]:[W1204 13:34:33.168431073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0715473Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0715734Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0715896Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0716262Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0716465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0716570Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0716666Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0716760Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0716762Z 2025-12-04T13:44:26.0716997Z [rank3]:[W1204 13:34:33.169680435 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0717167Z [rank2]:[W1204 13:34:33.192928765 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0717343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0717639Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0717801Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0718169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0718370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0718486Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0718610Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0718707Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0718709Z 2025-12-04T13:44:26.0718944Z [rank2]:[W1204 13:34:33.195100157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0719115Z [rank1]:[W1204 13:34:33.782290108 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0719309Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0719566Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0719729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0720095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0720296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0720402Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0720498Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0720595Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0720597Z 2025-12-04T13:44:26.0720829Z [rank1]:[W1204 13:34:33.783610039 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0721001Z [rank3]:[W1204 13:34:34.169808075 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0721176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0721434Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0721598Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0721962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0722165Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0722268Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0722378Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0722496Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0722499Z 2025-12-04T13:44:26.0722737Z [rank3]:[W1204 13:34:34.171343301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0722907Z [rank2]:[W1204 13:34:34.195265015 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0723081Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0723356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0723520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0723887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0724089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0724195Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0724291Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0724388Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0724391Z 2025-12-04T13:44:26.0724626Z [rank2]:[W1204 13:34:34.197560065 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0724796Z [rank1]:[W1204 13:34:34.783782027 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0724971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0725225Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0725390Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0725758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0725958Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0726063Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0726159Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0726276Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0726287Z 2025-12-04T13:44:26.0726531Z [rank1]:[W1204 13:34:34.785656986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0726702Z [rank3]:[W1204 13:34:35.171495340 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0726877Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0727132Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0727308Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0727723Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0727924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0728028Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0728125Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0728220Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0728223Z 2025-12-04T13:44:26.0728460Z [rank3]:[W1204 13:34:35.172790101 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0728631Z [rank2]:[W1204 13:34:35.197733183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0728804Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0729062Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0729224Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0729596Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0729798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0729902Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0729998Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0730097Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0730098Z 2025-12-04T13:44:26.0730357Z [rank2]:[W1204 13:34:35.200122581 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0730538Z [rank1]:[W1204 13:34:35.785816455 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0730714Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0730970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0731150Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0731521Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0731723Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0731827Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0731921Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0732017Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0732020Z 2025-12-04T13:44:26.0732253Z [rank1]:[W1204 13:34:35.787831051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0732424Z [rank3]:[W1204 13:34:36.172947401 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0732599Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0732857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0733021Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0733392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0733594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0733698Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0733797Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0733892Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0733896Z 2025-12-04T13:44:26.0734128Z [rank3]:[W1204 13:34:36.174203063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0734317Z [rank2]:[W1204 13:34:36.200263101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0734502Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0734757Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0734918Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0735298Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0735501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0735604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0735702Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0735798Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0735799Z 2025-12-04T13:44:26.0736035Z [rank2]:[W1204 13:34:36.201970363 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0736205Z [rank1]:[W1204 13:34:36.787958261 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0736383Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0736637Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0736800Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0737166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0737368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0737511Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0737606Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0737702Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0737704Z 2025-12-04T13:44:26.0737937Z [rank1]:[W1204 13:34:36.790256300 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0738121Z [rank3]:[W1204 13:34:37.174362202 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0738320Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0738576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0738739Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0739103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0739327Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0739432Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0739528Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0739624Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0739626Z 2025-12-04T13:44:26.0739860Z [rank3]:[W1204 13:34:37.175612005 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0740032Z [rank2]:[W1204 13:34:37.202076503 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0740207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0740462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0740623Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0740988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0741193Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0741298Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0741394Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0741489Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0741491Z 2025-12-04T13:44:26.0741725Z [rank2]:[W1204 13:34:37.204269585 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0741895Z [rank1]:[W1204 13:34:37.790411879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0742090Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0742356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0742518Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0742887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0743099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0743206Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0743301Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0743397Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0743399Z 2025-12-04T13:44:26.0743630Z [rank1]:[W1204 13:34:37.792246489 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0743800Z [rank3]:[W1204 13:34:38.175741265 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0743978Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0744233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0744397Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0744764Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0744967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0745072Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0745169Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0745267Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0745269Z 2025-12-04T13:44:26.0745504Z [rank3]:[W1204 13:34:38.176985318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0745674Z [rank2]:[W1204 13:34:38.204377456 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0745849Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0746123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0746295Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0746669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0746883Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0746988Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0747085Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0747182Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0747183Z 2025-12-04T13:44:26.0747416Z [rank2]:[W1204 13:34:38.206421971 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0747623Z [rank1]:[W1204 13:34:38.792426538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0747800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0748058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0748222Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0748592Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0748793Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0748899Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0748996Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0749093Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0749094Z 2025-12-04T13:44:26.0749328Z [rank1]:[W1204 13:34:38.794623430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0749499Z [rank3]:[W1204 13:34:39.177158717 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0749675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0749942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0750139Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0750508Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0750709Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0750827Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0750925Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0751023Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0751025Z 2025-12-04T13:44:26.0751256Z [rank3]:[W1204 13:34:39.178442949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0751426Z [rank2]:[W1204 13:34:39.206569551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0751599Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0751857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0752022Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0752390Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0752592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0752697Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0752792Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0752888Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0752891Z 2025-12-04T13:44:26.0753126Z [rank2]:[W1204 13:34:39.207793934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0753297Z [rank1]:[W1204 13:34:39.794806059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0753473Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0753734Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0753916Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0754298Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0754496Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0754601Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0754705Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0754801Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0754804Z 2025-12-04T13:44:26.0755038Z [rank1]:[W1204 13:34:39.796863064 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0755208Z [rank3]:[W1204 13:34:40.178601249 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0755383Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0755640Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0755808Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0756174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0756376Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0756481Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0756576Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0756674Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0756676Z 2025-12-04T13:44:26.0756909Z [rank3]:[W1204 13:34:40.180877129 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0757080Z [rank2]:[W1204 13:34:40.207940754 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0757254Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0757543Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0757706Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0758098Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0758312Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0758415Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0758511Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0758620Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0758622Z 2025-12-04T13:44:26.0758859Z [rank2]:[W1204 13:34:40.209698085 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0759030Z [rank1]:[W1204 13:34:40.797072283 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0759206Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0759462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0759625Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0759994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0760196Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0760301Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0760396Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0760492Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0760495Z 2025-12-04T13:44:26.0760729Z [rank1]:[W1204 13:34:40.799400622 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0760901Z [rank3]:[W1204 13:34:41.181059578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0761076Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0761333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0761497Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0761885Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0762100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0762207Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0762303Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0762402Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0762415Z 2025-12-04T13:44:26.0762650Z [rank3]:[W1204 13:34:41.182698192 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0762821Z [rank2]:[W1204 13:34:41.209841226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0762996Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0763251Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0763412Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0763783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0763988Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0764092Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0764188Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0764284Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0764285Z 2025-12-04T13:44:26.0764519Z [rank2]:[W1204 13:34:41.211253465 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0764692Z [rank1]:[W1204 13:34:41.799546642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0764867Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0765122Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0765283Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0765663Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0765885Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0765990Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0766084Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0766180Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0766182Z 2025-12-04T13:44:26.0766414Z [rank1]:[W1204 13:34:41.800763845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0766597Z [rank3]:[W1204 13:34:42.182862372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0766774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0767029Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0767191Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0767595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0767800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0767905Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0767999Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0768095Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0768097Z 2025-12-04T13:44:26.0768329Z [rank3]:[W1204 13:34:42.184908097 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0768505Z [rank2]:[W1204 13:34:42.211350486 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0768680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0768937Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0769099Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0769476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0769713Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0769829Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0769925Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0770021Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0770024Z 2025-12-04T13:44:26.0770258Z [rank2]:[W1204 13:34:42.212670457 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0770440Z [rank1]:[W1204 13:34:42.800871917 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0770616Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0770874Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0771035Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0771402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0771606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0771714Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0771808Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0771904Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0771906Z 2025-12-04T13:44:26.0772138Z [rank1]:[W1204 13:34:42.802026611 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0772307Z [rank3]:[W1204 13:34:43.185075197 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0772484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0772742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0772907Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0773273Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0773476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0773601Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0773708Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0773804Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0773806Z 2025-12-04T13:44:26.0774038Z [rank3]:[W1204 13:34:43.187423796 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0774210Z [rank2]:[W1204 13:34:43.212775449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0774400Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0774657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0774821Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0775192Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0775395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0775499Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0775597Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0775695Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0775696Z 2025-12-04T13:44:26.0775932Z [rank2]:[W1204 13:34:43.214013282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0776101Z [rank1]:[W1204 13:34:43.802143613 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0776278Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0776535Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0776696Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0777062Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0777262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0777369Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0777521Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0777639Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0777641Z 2025-12-04T13:44:26.0777875Z [rank1]:[W1204 13:34:43.803926774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0778045Z [rank3]:[W1204 13:34:44.187544877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0778219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0778489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0778653Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0779020Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0779221Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0779328Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0779423Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0779521Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0779524Z 2025-12-04T13:44:26.0779762Z [rank3]:[W1204 13:34:44.189048884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0779935Z [rank2]:[W1204 13:34:44.214146983 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0780107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0780363Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0780528Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0780897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0781100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0781204Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0781302Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0781410Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0781436Z 2025-12-04T13:44:26.0781672Z [rank2]:[W1204 13:34:44.215389696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0781840Z [rank1]:[W1204 13:34:44.804056575 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0782015Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0782269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0782442Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0782811Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0783015Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0783121Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0783217Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0783315Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0783317Z 2025-12-04T13:44:26.0783553Z [rank1]:[W1204 13:34:44.806284486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0783723Z [rank3]:[W1204 13:34:45.189162626 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0783899Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0784155Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0784319Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0784685Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0784887Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0784993Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0785088Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0785189Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0785191Z 2025-12-04T13:44:26.0785450Z [rank3]:[W1204 13:34:45.190649833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0785630Z [rank2]:[W1204 13:34:45.215524677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0785803Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0786058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0786231Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0786597Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0786802Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0786907Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0787005Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0787102Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0787104Z 2025-12-04T13:44:26.0787339Z [rank2]:[W1204 13:34:45.216748340 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0787542Z [rank1]:[W1204 13:34:45.806440097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0787715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0787973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0788135Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0788506Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0788710Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0788816Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0788913Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0789010Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0789013Z 2025-12-04T13:44:26.0789262Z [rank1]:[W1204 13:34:45.807691750 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0789458Z [rank3]:[W1204 13:34:46.190836734 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0789634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0789888Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0790050Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0790435Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0790636Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0790740Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0790836Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0790935Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0790938Z 2025-12-04T13:44:26.0791172Z [rank3]:[W1204 13:34:46.192780821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0791345Z [rank2]:[W1204 13:34:46.216896481 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0791519Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0791774Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0791937Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0792307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0792510Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0792614Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0792710Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0792806Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0792808Z 2025-12-04T13:44:26.0793042Z [rank2]:[W1204 13:34:46.218129854 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0793234Z [rank1]:[W1204 13:34:46.807840991 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0793417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0793675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0793836Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0794203Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0794415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0794522Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0794619Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0794714Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0794716Z 2025-12-04T13:44:26.0794950Z [rank1]:[W1204 13:34:46.809111223 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0795120Z [rank3]:[W1204 13:34:47.192942382 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0795297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0795555Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0795718Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0796086Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0796289Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0796394Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0796489Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0796586Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0796588Z 2025-12-04T13:44:26.0796820Z [rank3]:[W1204 13:34:47.194464329 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0796991Z [rank2]:[W1204 13:34:47.218294145 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0797186Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0797451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0797650Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0798018Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0798242Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0798347Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0798444Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0798543Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0798544Z 2025-12-04T13:44:26.0798778Z [rank2]:[W1204 13:34:47.219795862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0798950Z [rank1]:[W1204 13:34:47.809242055 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0799125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0799381Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0799541Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0799906Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0800109Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0800214Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0800310Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0800406Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0800408Z 2025-12-04T13:44:26.0800643Z [rank1]:[W1204 13:34:47.810486378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0800815Z [rank3]:[W1204 13:34:48.194591161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0800992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0801272Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0801448Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0801814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0802025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0802132Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0802227Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0802325Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0802327Z 2025-12-04T13:44:26.0802561Z [rank3]:[W1204 13:34:48.196996818 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0802733Z [rank2]:[W1204 13:34:48.219944894 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0802908Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0803165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0803329Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0803696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0803898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0804003Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0804100Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0804197Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0804199Z 2025-12-04T13:44:26.0804432Z [rank2]:[W1204 13:34:48.221950220 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0804604Z [rank1]:[W1204 13:34:48.810631179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0804778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0805054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0805227Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0805593Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0805794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0805911Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0806011Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0806108Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0806110Z 2025-12-04T13:44:26.0806343Z [rank1]:[W1204 13:34:48.811888282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0806513Z [rank3]:[W1204 13:34:49.197258877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0806689Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0806946Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0807110Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0807509Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0807711Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0807817Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0807912Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0808010Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0808013Z 2025-12-04T13:44:26.0808244Z [rank3]:[W1204 13:34:49.199436829 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0808416Z [rank2]:[W1204 13:34:49.222048073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0808590Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0808848Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0809042Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0809423Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0809625Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0809728Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0809837Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0809935Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0809940Z 2025-12-04T13:44:26.0810173Z [rank2]:[W1204 13:34:49.223190138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0810344Z [rank1]:[W1204 13:34:49.812040294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0810516Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0810772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0810936Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0811307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0811508Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0811611Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0811708Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0811803Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0811806Z 2025-12-04T13:44:26.0812042Z [rank1]:[W1204 13:34:49.813425803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0812212Z [rank3]:[W1204 13:34:50.199638190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0812387Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0812642Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0812809Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0813197Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0813415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0813520Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0813614Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0813720Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0813722Z 2025-12-04T13:44:26.0813956Z [rank3]:[W1204 13:34:50.201515439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0814127Z [rank2]:[W1204 13:34:50.223309390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0814302Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0814559Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0814726Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0815094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0815296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0815400Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0815497Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0815593Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0815598Z 2025-12-04T13:44:26.0815831Z [rank2]:[W1204 13:34:50.225563991 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0816003Z [rank1]:[W1204 13:34:50.813611694 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0816179Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0816437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0816600Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0816986Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0817199Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0817303Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0817397Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0817530Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0817548Z 2025-12-04T13:44:26.0817783Z [rank1]:[W1204 13:34:50.815458494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0817956Z [rank3]:[W1204 13:34:51.201681001 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0818132Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0818386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0818550Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0818921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0819124Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0819229Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0819324Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0819421Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0819423Z 2025-12-04T13:44:26.0819656Z [rank3]:[W1204 13:34:51.203311815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0819829Z [rank2]:[W1204 13:34:51.225955087 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0820006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0820262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0820427Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0820820Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0821036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0821139Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0821238Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0821335Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0821338Z 2025-12-04T13:44:26.0821571Z [rank2]:[W1204 13:34:51.227842766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0821754Z [rank1]:[W1204 13:34:51.815627596 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0821928Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0822182Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0822343Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0822710Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0822913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0823017Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0823113Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0823208Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0823210Z 2025-12-04T13:44:26.0823453Z [rank1]:[W1204 13:34:51.817443136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0823624Z [rank3]:[W1204 13:34:52.203408208 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0823802Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0824058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0824222Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0824591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0824811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0824926Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0825021Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0825119Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0825121Z 2025-12-04T13:44:26.0825353Z [rank3]:[W1204 13:34:52.205818895 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0825534Z [rank2]:[W1204 13:34:52.227951679 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0825718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0825972Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0826135Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0826502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0826708Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0826812Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0826910Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0827007Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0827009Z 2025-12-04T13:44:26.0827241Z [rank2]:[W1204 13:34:52.230007674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0827412Z [rank1]:[W1204 13:34:52.817607418 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0827631Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0827888Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0828049Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0828418Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0828637Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0828772Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0828868Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0828964Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0828966Z 2025-12-04T13:44:26.0829200Z [rank1]:[W1204 13:34:52.819446667 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0829370Z [rank3]:[W1204 13:34:53.205980937 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0829560Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0829815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0829978Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0830347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0830550Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0830656Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0830753Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0830851Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0830854Z 2025-12-04T13:44:26.0831087Z [rank3]:[W1204 13:34:53.208214698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0831258Z [rank2]:[W1204 13:34:53.230082388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0831435Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0831692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0831856Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0832225Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0832426Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0832531Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0832651Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0832758Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0832760Z 2025-12-04T13:44:26.0832991Z [rank2]:[W1204 13:34:53.232278430 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0833162Z [rank1]:[W1204 13:34:53.819624169 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0833336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0833604Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0833768Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0834136Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0834338Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0834443Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0834540Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0834638Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0834640Z 2025-12-04T13:44:26.0834875Z [rank1]:[W1204 13:34:53.821459719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0835046Z [rank3]:[W1204 13:34:54.208382681 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0835221Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0835480Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0835646Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0836013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0836215Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0836321Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0836417Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0836534Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0836546Z 2025-12-04T13:44:26.0836777Z [rank3]:[W1204 13:34:54.210607102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0836948Z [rank2]:[W1204 13:34:54.232377914 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0837123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0837388Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0837594Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0837960Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0838162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0838266Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0838363Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0838461Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0838464Z 2025-12-04T13:44:26.0838701Z [rank2]:[W1204 13:34:54.234441168 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0838872Z [rank1]:[W1204 13:34:54.821613332 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0839047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0839303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0839467Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0839839Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0840040Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0840144Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0840241Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0840336Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0840338Z 2025-12-04T13:44:26.0840603Z [rank1]:[W1204 13:34:54.823293335 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0840787Z [rank3]:[W1204 13:34:55.210872192 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0840963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0841219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0841396Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0841763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0841965Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0842069Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0842165Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0842266Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0842268Z 2025-12-04T13:44:26.0842504Z [rank3]:[W1204 13:34:55.213192101 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0842675Z [rank2]:[W1204 13:34:55.234548252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0842850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0843104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0843268Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0843635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0843838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0843951Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0844047Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0844149Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0844151Z 2025-12-04T13:44:26.0844413Z [rank2]:[W1204 13:34:55.236351493 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0844593Z [rank1]:[W1204 13:34:55.823480777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0844768Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0845024Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0845186Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0845564Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0845767Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0845872Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0845971Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0846066Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0846069Z 2025-12-04T13:44:26.0846304Z [rank1]:[W1204 13:34:55.825330646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0851098Z [rank3]:[W1204 13:34:56.213356434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0851288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0851544Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0851708Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0852094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0852296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0852401Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0852496Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0852593Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0852596Z 2025-12-04T13:44:26.0852831Z [rank3]:[W1204 13:34:56.215425358 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0853047Z [rank2]:[W1204 13:34:56.236463186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0853240Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0853494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0853659Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0854050Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0854254Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0854358Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0854453Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0854550Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0854552Z 2025-12-04T13:44:26.0854786Z [rank2]:[W1204 13:34:56.238572920 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0854962Z [rank1]:[W1204 13:34:56.825494639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0855140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0855395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0855557Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0855927Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0856132Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0856236Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0856332Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0856427Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0856429Z 2025-12-04T13:44:26.0856662Z [rank1]:[W1204 13:34:56.827021026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0856842Z [rank3]:[W1204 13:34:57.215615881 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0857037Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0857296Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0857460Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0857870Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0858090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0858195Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0858289Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0858385Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0858387Z 2025-12-04T13:44:26.0858619Z [rank3]:[W1204 13:34:57.217615347 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0858790Z [rank2]:[W1204 13:34:57.238673144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0858968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0859223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0859385Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0859750Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0859953Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0860058Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0860153Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0860250Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0860252Z 2025-12-04T13:44:26.0860485Z [rank2]:[W1204 13:34:57.240596222 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0860657Z [rank1]:[W1204 13:34:57.827192788 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0860845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0861130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0861292Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0861657Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0861878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0861983Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0862081Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0862176Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0862178Z 2025-12-04T13:44:26.0862411Z [rank1]:[W1204 13:34:57.829583026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0862581Z [rank3]:[W1204 13:34:58.217783040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0862757Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0863014Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0863177Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0863543Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0863744Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0863849Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0863946Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0864042Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0864044Z 2025-12-04T13:44:26.0864277Z [rank3]:[W1204 13:34:58.219367875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0864446Z [rank2]:[W1204 13:34:58.240701247 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0864623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0864894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0865069Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0865435Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0865640Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0865755Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0865852Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0865950Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0865953Z 2025-12-04T13:44:26.0866186Z [rank2]:[W1204 13:34:58.241878651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0866355Z [rank1]:[W1204 13:34:58.829719620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0866529Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0866786Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0866948Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0867315Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0867567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0867671Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0867768Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0867865Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0867867Z 2025-12-04T13:44:26.0868100Z [rank1]:[W1204 13:34:58.831958191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0868269Z [rank3]:[W1204 13:34:59.219560528 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0868443Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0868715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0868903Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0869274Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0869476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0869592Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0869688Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0869785Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0869788Z 2025-12-04T13:44:26.0870020Z [rank3]:[W1204 13:34:59.221947935 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0870189Z [rank2]:[W1204 13:34:59.241989525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0870364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0870617Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0870782Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0871152Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0871354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0871458Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0871555Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0871655Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0871657Z 2025-12-04T13:44:26.0871891Z [rank2]:[W1204 13:34:59.244262515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0872062Z [rank1]:[W1204 13:34:59.832144813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0872235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0872490Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0872663Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0873055Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0873257Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0873360Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0873467Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0873562Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0873564Z 2025-12-04T13:44:26.0873797Z [rank1]:[W1204 13:34:59.834312396 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0873967Z [rank3]:[W1204 13:35:00.222138978 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0874142Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0874400Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0874564Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0874930Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0875131Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0875237Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0875333Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0875431Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0875433Z 2025-12-04T13:44:26.0875667Z [rank3]:[W1204 13:35:00.224515056 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0875838Z [rank2]:[W1204 13:35:00.244417079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0876016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0876272Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0876437Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0876825Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0877041Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0877146Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0877242Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0877350Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0877352Z 2025-12-04T13:44:26.0877622Z [rank2]:[W1204 13:35:00.246776177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0877796Z [rank1]:[W1204 13:35:00.834437570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0877971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0878225Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0878387Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0878756Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0878957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0879060Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0879156Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0879252Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0879255Z 2025-12-04T13:44:26.0879490Z [rank1]:[W1204 13:35:00.836833698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0879662Z [rank3]:[W1204 13:35:01.224709949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0879838Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0880093Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0880253Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0880648Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0880862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0880967Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0881062Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0881159Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0881160Z 2025-12-04T13:44:26.0881414Z [rank3]:[W1204 13:35:01.226876841 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0881585Z [rank2]:[W1204 13:35:01.246911191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0881761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0882015Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0882177Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0882545Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0882748Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0882853Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0882948Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0883045Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0883047Z 2025-12-04T13:44:26.0883279Z [rank2]:[W1204 13:35:01.249015755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0883325Z PASSED [37.2610s] [ 13%] 2025-12-04T13:44:26.0883589Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager I1204 13:35:01.372000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 82945 2025-12-04T13:44:26.0883743Z I1204 13:35:01.373000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 82946 2025-12-04T13:44:26.0883893Z I1204 13:35:01.373000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 82947 2025-12-04T13:44:26.0884041Z I1204 13:35:01.373000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 82948 2025-12-04T13:44:26.0884212Z [rank1]:[W1204 13:35:01.837030181 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0884387Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0884664Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0884836Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0885203Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0885415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0885522Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0885619Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0885717Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0885719Z 2025-12-04T13:44:26.0885951Z [rank1]:[W1204 13:35:01.838538928 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0886120Z [rank3]:[W1204 13:35:02.227052675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0886297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0886552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0886719Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0887085Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0887286Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0887392Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0887529Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0887626Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0887628Z 2025-12-04T13:44:26.0887859Z [rank3]:[W1204 13:35:02.228307847 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0888029Z [rank2]:[W1204 13:35:02.249146590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0888203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0888487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0888664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0889030Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0889231Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0889349Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0889447Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0889546Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0889548Z 2025-12-04T13:44:26.0889780Z [rank2]:[W1204 13:35:02.250505850 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0889951Z [rank1]:[W1204 13:35:02.838686042 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0890125Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0890381Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0890544Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0890910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0891110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0891216Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0891311Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0891408Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0891411Z 2025-12-04T13:44:26.0891645Z [rank1]:[W1204 13:35:02.839922995 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0891816Z [rank3]:[W1204 13:35:03.228467841 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0891992Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0892246Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0892438Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0892815Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0893016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0893120Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0893226Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0893323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0893327Z 2025-12-04T13:44:26.0893560Z [rank3]:[W1204 13:35:03.229691245 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0893731Z [rank2]:[W1204 13:35:03.250647624 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0893904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0894159Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0894323Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0894689Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0894892Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0894996Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0895092Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0895188Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0895190Z 2025-12-04T13:44:26.0895424Z [rank2]:[W1204 13:35:03.252327437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0895597Z [rank1]:[W1204 13:35:03.840081559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0895770Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0896025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0896187Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0896584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0896794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0896899Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0896994Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0897103Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0897105Z 2025-12-04T13:44:26.0897340Z [rank1]:[W1204 13:35:03.841393460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0897543Z [rank3]:[W1204 13:35:04.229870028 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0897718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0897973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0898137Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0898508Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0898709Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0898812Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0898906Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0899004Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0899006Z 2025-12-04T13:44:26.0899241Z [rank3]:[W1204 13:35:04.231743597 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0899412Z [rank2]:[W1204 13:35:04.252461992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0899586Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0899842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0900005Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0900406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0900620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0900726Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0900822Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0900917Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0900935Z 2025-12-04T13:44:26.0901172Z [rank2]:[W1204 13:35:04.254499377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0901343Z [rank1]:[W1204 13:35:04.841563104 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0901517Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0901770Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0901933Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0902304Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0902506Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0902611Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0902707Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0902802Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0902804Z 2025-12-04T13:44:26.0903039Z [rank1]:[W1204 13:35:04.843283887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0903212Z [rank3]:[W1204 13:35:05.231932551 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0903389Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0903646Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0903810Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0904201Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0904412Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0904516Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0904610Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0904706Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0904707Z 2025-12-04T13:44:26.0904938Z [rank3]:[W1204 13:35:05.234119673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0905125Z [rank2]:[W1204 13:35:05.254630402 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0905301Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0905555Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0905717Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0906081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0906285Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0906388Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0906486Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0906582Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0906584Z 2025-12-04T13:44:26.0906819Z [rank2]:[W1204 13:35:05.256670718 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0906993Z [rank1]:[W1204 13:35:05.843407022 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0907167Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0907422Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0907625Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0907990Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0908224Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0908340Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0908435Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0908531Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0908532Z 2025-12-04T13:44:26.0908768Z [rank1]:[W1204 13:35:05.845176433 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0908950Z [rank3]:[W1204 13:35:06.234280048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0909126Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0909381Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0909544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0909907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0910110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0910216Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0910311Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0910408Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0910410Z 2025-12-04T13:44:26.0910642Z [rank3]:[W1204 13:35:06.236300203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0910813Z [rank2]:[W1204 13:35:06.256803753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0910989Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0911245Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0911408Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0911770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0911973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0912098Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0912203Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0912299Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0912302Z 2025-12-04T13:44:26.0912534Z [rank2]:[W1204 13:35:06.259010265 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0912704Z [rank1]:[W1204 13:35:06.845369187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0912890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0913146Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0913308Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0913674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0913876Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0913981Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0914079Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0914174Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0914176Z 2025-12-04T13:44:26.0914409Z [rank1]:[W1204 13:35:06.847246316 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0914577Z [rank3]:[W1204 13:35:07.236451358 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0914753Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0915008Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0915170Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0915535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0915737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0915844Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0915957Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0916063Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0916065Z 2025-12-04T13:44:26.0916297Z [rank3]:[W1204 13:35:07.237693571 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0916468Z [rank2]:[W1204 13:35:07.259139100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0916641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0916909Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0917073Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0917440Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0917674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0917779Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0917877Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0917974Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0917976Z 2025-12-04T13:44:26.0918212Z [rank2]:[W1204 13:35:07.261232584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0918384Z [rank1]:[W1204 13:35:07.847424050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0918558Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0918814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0918978Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0919346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0919547Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0919651Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0919746Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0919872Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0919886Z 2025-12-04T13:44:26.0920120Z [rank1]:[W1204 13:35:07.849515054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0920290Z [rank3]:[W1204 13:35:08.237793187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0920465Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0920720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0920904Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0921272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0921474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0921579Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0921675Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0921771Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0921774Z 2025-12-04T13:44:26.0922007Z [rank3]:[W1204 13:35:08.238932352 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0922177Z [rank2]:[W1204 13:35:08.261299561 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0922351Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0922606Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0922772Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0923140Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0923345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0923447Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0923545Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0923641Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0923644Z 2025-12-04T13:44:26.0923897Z [rank2]:[W1204 13:35:08.262852027 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0924080Z [rank1]:[W1204 13:35:08.849678409 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0924254Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0924511Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0924683Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0925052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0925253Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0925356Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0925454Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0925553Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0925555Z 2025-12-04T13:44:26.0925789Z [rank1]:[W1204 13:35:08.851012330 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0925958Z [rank3]:[W1204 13:35:09.239118347 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0926137Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0926392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0926567Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0926935Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0927136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0927240Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0927334Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0927430Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0927432Z 2025-12-04T13:44:26.0927715Z [rank3]:[W1204 13:35:09.240549535 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0927906Z [rank2]:[W1204 13:35:09.262960073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0928083Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0928337Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0928501Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0928882Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0929086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0929190Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0929285Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0929382Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0929385Z 2025-12-04T13:44:26.0929617Z [rank2]:[W1204 13:35:09.264133058 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0929793Z [rank1]:[W1204 13:35:09.851176225 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0929968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0930221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0930383Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0930751Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0930952Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0931059Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0931154Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0931248Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0931250Z 2025-12-04T13:44:26.0931486Z [rank1]:[W1204 13:35:09.853315578 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0931572Z SKIPPED [8.8154s] (Test skipped due to missing import) [ 16%] 2025-12-04T13:44:26.0931875Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph I1204 13:35:10.189000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 85569 2025-12-04T13:44:26.0932027Z I1204 13:35:10.190000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 85570 2025-12-04T13:44:26.0932176Z I1204 13:35:10.190000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 85571 2025-12-04T13:44:26.0932324Z I1204 13:35:10.191000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 85572 2025-12-04T13:44:26.0932504Z [rank3]:[W1204 13:35:10.240725300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0932683Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0932939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0933104Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0933472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0933675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0933784Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0933879Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0933975Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0933977Z 2025-12-04T13:44:26.0934210Z [rank3]:[W1204 13:35:10.241979513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0934381Z [rank2]:[W1204 13:35:10.264230034 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0934557Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0934814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0934978Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0935348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0935551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0935676Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0935790Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0935885Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0935887Z 2025-12-04T13:44:26.0936120Z [rank2]:[W1204 13:35:10.265367299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0936291Z [rank1]:[W1204 13:35:10.853489164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0936475Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0936737Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0936899Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0937264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0937464Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0937602Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0937699Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0937795Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0937797Z 2025-12-04T13:44:26.0938030Z [rank1]:[W1204 13:35:10.855652606 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0938199Z [rank3]:[W1204 13:35:11.242155918 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0938374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0938632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0938796Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0939163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0939365Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0939472Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0939595Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0939703Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0939705Z 2025-12-04T13:44:26.0939937Z [rank3]:[W1204 13:35:11.244155704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0940108Z [rank2]:[W1204 13:35:11.265464256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0940283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0940553Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0940717Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0941082Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0941284Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0941388Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0941485Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0941583Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0941586Z 2025-12-04T13:44:26.0941819Z [rank2]:[W1204 13:35:11.267036202 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0941990Z [rank1]:[W1204 13:35:11.855819711 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0942164Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0942421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0942584Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0942950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0943151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0943257Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0943352Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0943465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0943486Z 2025-12-04T13:44:26.0943724Z [rank1]:[W1204 13:35:11.858137811 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0943896Z [rank3]:[W1204 13:35:12.244328680 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0944070Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0944325Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0944500Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0944870Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0945072Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0945176Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0945272Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0945368Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0945371Z 2025-12-04T13:44:26.0945610Z [rank3]:[W1204 13:35:12.246720297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0945781Z [rank2]:[W1204 13:35:12.267129309 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0945955Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0946211Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0946378Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0946747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0946949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0947052Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0947148Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0947244Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0947247Z 2025-12-04T13:44:26.0947559Z [rank2]:[W1204 13:35:12.269470657 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0947742Z [rank1]:[W1204 13:35:12.858313746 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0947917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0948170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0948347Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0948716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0948923Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0949028Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0949123Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0949221Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0949223Z 2025-12-04T13:44:26.0949460Z [rank1]:[W1204 13:35:12.860137656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0949630Z [rank3]:[W1204 13:35:13.246917822 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0949805Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0950058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0950222Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0950590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0950791Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0950897Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0950994Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0951089Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0951093Z 2025-12-04T13:44:26.0951337Z [rank3]:[W1204 13:35:13.248428109 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0951533Z [rank2]:[W1204 13:35:13.269608104 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0951708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0951963Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0952127Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0952507Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0952712Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0952815Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0952911Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0953006Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0953008Z 2025-12-04T13:44:26.0953243Z [rank2]:[W1204 13:35:13.271577261 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0953414Z [rank1]:[W1204 13:35:13.860241493 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0953587Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0953842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0954003Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0954373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0954575Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0954681Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0954776Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0954873Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0954875Z 2025-12-04T13:44:26.0955108Z [rank1]:[W1204 13:35:13.861498296 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0955298Z [rank3]:[W1204 13:35:14.248604015 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0955483Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0955738Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0955901Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0956268Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0956483Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0956588Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0956682Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0956778Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0956780Z 2025-12-04T13:44:26.0957013Z [rank3]:[W1204 13:35:14.249838608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0957185Z [rank2]:[W1204 13:35:14.271768406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0957360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0957648Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0957811Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0958182Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0958388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0958493Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0958589Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0958684Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0958686Z 2025-12-04T13:44:26.0958919Z [rank2]:[W1204 13:35:14.273720253 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0959092Z [rank1]:[W1204 13:35:14.861617953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0959293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0959560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0959721Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0960090Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0960307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0960413Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0960507Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0960602Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0960604Z 2025-12-04T13:44:26.0960837Z [rank1]:[W1204 13:35:14.863522701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0961006Z [rank3]:[W1204 13:35:15.250015383 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0961180Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0961435Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0961600Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0961967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0962169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0962276Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0962370Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0962466Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0962469Z 2025-12-04T13:44:26.0962701Z [rank3]:[W1204 13:35:15.251241846 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0962870Z [rank2]:[W1204 13:35:15.273832920 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0963045Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0963322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0963497Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0963864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0964079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0964183Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0964281Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0964376Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0964378Z 2025-12-04T13:44:26.0964611Z [rank2]:[W1204 13:35:15.274966615 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0964781Z [rank1]:[W1204 13:35:15.863714056 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0964955Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0965222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0965386Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0965753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0965953Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0966059Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0966156Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0966252Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0966253Z 2025-12-04T13:44:26.0966487Z [rank1]:[W1204 13:35:15.865305222 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0966656Z [rank3]:[W1204 13:35:16.251435143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0966830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0967119Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0967293Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0967695Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0967896Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0968033Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0968130Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0968229Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0968230Z 2025-12-04T13:44:26.0968462Z [rank3]:[W1204 13:35:16.253228814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0968633Z [rank2]:[W1204 13:35:16.275096014 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0968807Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0969067Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0969233Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0969598Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0969801Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0969905Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0970001Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0970098Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0970101Z 2025-12-04T13:44:26.0970335Z [rank2]:[W1204 13:35:16.277452482 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0970507Z [rank1]:[W1204 13:35:16.865455000 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0970682Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0970940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0971127Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0971508Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0971710Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0971815Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0971922Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0972017Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0972021Z 2025-12-04T13:44:26.0972257Z [rank1]:[W1204 13:35:16.866705572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0972426Z [rank3]:[W1204 13:35:17.253430630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0972602Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0972857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0973022Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0973389Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0973589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0973694Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0973791Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0973887Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0973889Z 2025-12-04T13:44:26.0974125Z [rank3]:[W1204 13:35:17.255708640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0974295Z [rank2]:[W1204 13:35:17.277563310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0975628Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0975886Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0976051Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0976433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0976646Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0976764Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0976860Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0976977Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0976979Z 2025-12-04T13:44:26.0977214Z [rank2]:[W1204 13:35:17.278841492 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0977385Z [rank1]:[W1204 13:35:17.866861179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0977684Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0977939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0978105Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0978476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0978680Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0978786Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0978882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0978980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0978983Z 2025-12-04T13:44:26.0979217Z [rank1]:[W1204 13:35:17.868094242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0979390Z [rank3]:[W1204 13:35:18.255843477 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0979563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0979884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0980047Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0980429Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0980646Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0980751Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0980850Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0980945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0980972Z 2025-12-04T13:44:26.0981208Z [rank3]:[W1204 13:35:18.258197885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0981378Z [rank2]:[W1204 13:35:18.278953140 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0981553Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0981808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0981971Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0982340Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0982544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0982650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0982746Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0982845Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0982848Z 2025-12-04T13:44:26.0983083Z [rank2]:[W1204 13:35:18.281161901 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0983160Z SKIPPED [8.5158s] (Test skipped due to missing import) [ 19%] 2025-12-04T13:44:26.0983426Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor I1204 13:35:18.707000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 88193 2025-12-04T13:44:26.0983578Z I1204 13:35:18.707000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 88194 2025-12-04T13:44:26.0983747Z I1204 13:35:18.708000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 88195 2025-12-04T13:44:26.0983894Z I1204 13:35:18.708000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 88196 2025-12-04T13:44:26.0984065Z [rank1]:[W1204 13:35:18.868282248 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0984253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0984520Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0984684Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0985050Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0985267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0985372Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0985469Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0985564Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0985566Z 2025-12-04T13:44:26.0985800Z [rank1]:[W1204 13:35:18.870118578 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0985976Z [rank3]:[W1204 13:35:19.258402841 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0986155Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0986416Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0986578Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0986946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0987147Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0987253Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0987351Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0987446Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0987448Z 2025-12-04T13:44:26.0987738Z [rank3]:[W1204 13:35:19.260601423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0987911Z [rank2]:[W1204 13:35:19.281287489 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0988087Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0988354Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0988530Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0988899Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0989116Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0989222Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0989319Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0989416Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0989418Z 2025-12-04T13:44:26.0989655Z [rank2]:[W1204 13:35:19.282861954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0989827Z [rank1]:[W1204 13:35:19.870312294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0990002Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0990259Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0990424Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0990788Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0990991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0991096Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0991193Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0991289Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0991291Z 2025-12-04T13:44:26.0991525Z [rank1]:[W1204 13:35:19.872133664 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0991711Z [rank3]:[W1204 13:35:20.260769230 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0991886Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0992153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0992324Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0992690Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0992891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0993008Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0993106Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0993202Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0993203Z 2025-12-04T13:44:26.0993435Z [rank3]:[W1204 13:35:20.263134248 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0993604Z [rank2]:[W1204 13:35:20.282978962 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0993780Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0994035Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0994202Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0994572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0994774Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0994881Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0994976Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0995074Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0995077Z 2025-12-04T13:44:26.0995308Z [rank2]:[W1204 13:35:20.284478679 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0995479Z [rank1]:[W1204 13:35:20.872295041 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0995664Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0995921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0996096Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0996484Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0996689Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0996793Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0996901Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0996997Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0997000Z 2025-12-04T13:44:26.0997232Z [rank1]:[W1204 13:35:20.873550214 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0997403Z [rank3]:[W1204 13:35:21.263335414 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0997618Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0997873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0998038Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0998410Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.0998612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.0998717Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.0998813Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0998908Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.0998909Z 2025-12-04T13:44:26.0999150Z [rank3]:[W1204 13:35:21.265447928 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.0999321Z [rank2]:[W1204 13:35:21.284585958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.0999498Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.0999767Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.0999933Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1000314Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1000529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1000636Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1000731Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1000844Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1000846Z 2025-12-04T13:44:26.1001078Z [rank2]:[W1204 13:35:21.285736183 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1001249Z [rank1]:[W1204 13:35:21.873696381 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1001423Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1001678Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1001843Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1002212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1002414Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1002518Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1002616Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1002711Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1002714Z 2025-12-04T13:44:26.1002949Z [rank1]:[W1204 13:35:21.874900655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1003119Z [rank3]:[W1204 13:35:22.265625735 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1003292Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1003562Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1003726Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1004105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1004315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1004422Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1004519Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1004614Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1004628Z 2025-12-04T13:44:26.1004863Z [rank3]:[W1204 13:35:22.266830299 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1005032Z [rank2]:[W1204 13:35:22.285839271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1005209Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1005466Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1005631Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1006000Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1006202Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1006307Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1006402Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1006499Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1006501Z 2025-12-04T13:44:26.1006736Z [rank2]:[W1204 13:35:22.286995376 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1006908Z [rank1]:[W1204 13:35:22.875080772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1007084Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1007343Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1007566Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1007947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1008166Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1008269Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1008366Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1008461Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1008463Z 2025-12-04T13:44:26.1008697Z [rank1]:[W1204 13:35:22.877329173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1008883Z [rank3]:[W1204 13:35:23.267015346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1009059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1009315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1009479Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1009850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1010055Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1010160Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1010255Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1010350Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1010352Z 2025-12-04T13:44:26.1010586Z [rank3]:[W1204 13:35:23.268263008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1010757Z [rank2]:[W1204 13:35:23.287100915 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1010934Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1011190Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1011353Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1011733Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1011951Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1012068Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1012161Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1012259Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1012263Z 2025-12-04T13:44:26.1012498Z [rank2]:[W1204 13:35:23.288238160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1012680Z [rank1]:[W1204 13:35:23.877431392 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1012857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1013111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1013275Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1013640Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1013844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1013949Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1014045Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1014140Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1014144Z 2025-12-04T13:44:26.1014377Z [rank1]:[W1204 13:35:23.878577287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1014547Z [rank3]:[W1204 13:35:24.268451096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1014723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1014985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1015151Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1015532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1015735Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1015850Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1015963Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1016062Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1016064Z 2025-12-04T13:44:26.1016302Z [rank3]:[W1204 13:35:24.270195507 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1016472Z [rank2]:[W1204 13:35:24.288341769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1016661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1016915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1017080Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1017448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1017690Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1017795Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1017891Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1017991Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1017993Z 2025-12-04T13:44:26.1018224Z [rank2]:[W1204 13:35:24.289496494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1018396Z [rank1]:[W1204 13:35:24.878786874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1018572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1018829Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1018994Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1019384Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1019587Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1019692Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1019800Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1019909Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1019913Z 2025-12-04T13:44:26.1020144Z [rank1]:[W1204 13:35:24.880163383 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1020319Z [rank3]:[W1204 13:35:25.270369965 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1020492Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1020766Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1020930Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1021300Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1021503Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1021607Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1021704Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1021799Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1021801Z 2025-12-04T13:44:26.1022036Z [rank3]:[W1204 13:35:25.271595228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1022209Z [rank2]:[W1204 13:35:25.289593763 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1022385Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1022639Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1022802Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1023175Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1023390Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1023497Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1023593Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1023700Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1023715Z 2025-12-04T13:44:26.1023947Z [rank2]:[W1204 13:35:25.290738168 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1024121Z [rank1]:[W1204 13:35:25.880294832 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1024297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1024552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1024729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1025097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1025299Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1025403Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1025501Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1025596Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1025599Z 2025-12-04T13:44:26.1025835Z [rank1]:[W1204 13:35:25.882572522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1026006Z [rank3]:[W1204 13:35:26.271794795 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1026183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1026439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1026602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1026970Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1027172Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1027293Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1027391Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1027522Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1027524Z 2025-12-04T13:44:26.1027777Z [rank3]:[W1204 13:35:26.273292212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1027960Z [rank2]:[W1204 13:35:26.290886536 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1028134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1028390Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1028570Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1028938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1029141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1029250Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1029348Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1029445Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1029447Z 2025-12-04T13:44:26.1029680Z [rank2]:[W1204 13:35:26.293157196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1029850Z [rank1]:[W1204 13:35:26.882737370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1030026Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1030280Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1030445Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1030814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1031016Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1031120Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1031230Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1031327Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1031330Z 2025-12-04T13:44:26.1031573Z [rank1]:[W1204 13:35:26.884604359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1031660Z SKIPPED [8.6163s] (Test skipped due to missing import) [ 22%] 2025-12-04T13:44:26.1031943Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph I1204 13:35:27.325000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 90817 2025-12-04T13:44:26.1032098Z I1204 13:35:27.325000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 90818 2025-12-04T13:44:26.1032249Z I1204 13:35:27.326000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 90819 2025-12-04T13:44:26.1032397Z I1204 13:35:27.326000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 90820 2025-12-04T13:44:26.1032580Z [rank3]:[W1204 13:35:27.273412232 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1032756Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1033010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1033173Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1033540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1033742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1033851Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1033946Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1034045Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1034047Z 2025-12-04T13:44:26.1034284Z [rank3]:[W1204 13:35:27.275765940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1034456Z [rank2]:[W1204 13:35:27.293258356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1034632Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1034886Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1035050Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1035430Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1035645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1035761Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1035857Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1035954Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1035956Z 2025-12-04T13:44:26.1036192Z [rank2]:[W1204 13:35:27.295327320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1036376Z [rank1]:[W1204 13:35:27.884767048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1036550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1036807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1036969Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1037335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1037579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1037688Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1037783Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1037883Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1037885Z 2025-12-04T13:44:26.1038120Z [rank1]:[W1204 13:35:27.886842652 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1038291Z [rank3]:[W1204 13:35:28.275969958 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1038465Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1038724Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1038889Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1039272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1039474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1039596Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1039706Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1039802Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1039804Z 2025-12-04T13:44:26.1040039Z [rank3]:[W1204 13:35:28.277658190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1040210Z [rank2]:[W1204 13:35:28.295439340 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1040406Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1040663Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1040828Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1041196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1041397Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1041504Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1041600Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1041698Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1041700Z 2025-12-04T13:44:26.1041936Z [rank2]:[W1204 13:35:28.297390127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1042107Z [rank1]:[W1204 13:35:28.887036760 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1042280Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1042537Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1042701Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1043081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1043287Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1043395Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1043502Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1043611Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1043613Z 2025-12-04T13:44:26.1043844Z [rank1]:[W1204 13:35:28.889464377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1044017Z [rank3]:[W1204 13:35:29.277771990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1044190Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1044459Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1044621Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1044991Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1045194Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1045298Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1045397Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1045493Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1045497Z 2025-12-04T13:44:26.1045729Z [rank3]:[W1204 13:35:29.279166800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1045900Z [rank2]:[W1204 13:35:29.297462798 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1046077Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1046332Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1046496Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1046864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1047076Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1047182Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1047279Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1047387Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1047400Z 2025-12-04T13:44:26.1047678Z [rank2]:[W1204 13:35:29.298580943 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1047848Z [rank1]:[W1204 13:35:29.889651265 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1048024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1048279Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1048458Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1048824Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1049026Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1049130Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1049226Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1049322Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1049324Z 2025-12-04T13:44:26.1049562Z [rank1]:[W1204 13:35:29.891126312 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1049735Z [rank3]:[W1204 13:35:30.279339828 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1049910Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1050172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1050335Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1050701Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1050904Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1051021Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1051118Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1051215Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1051217Z 2025-12-04T13:44:26.1051463Z [rank3]:[W1204 13:35:30.280559851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1051645Z [rank2]:[W1204 13:35:30.298679494 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1051821Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1052080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1052253Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1052619Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1052821Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1052928Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1053023Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1053122Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1053124Z 2025-12-04T13:44:26.1053359Z [rank2]:[W1204 13:35:30.299878487 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1053530Z [rank1]:[W1204 13:35:30.891258322 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1053707Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1053965Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1054129Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1054497Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1054700Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1054806Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1054910Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1055008Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1055011Z 2025-12-04T13:44:26.1055253Z [rank1]:[W1204 13:35:30.892457316 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1055441Z [rank3]:[W1204 13:35:31.280755670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1055617Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1055874Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1056037Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1056414Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1056617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1056720Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1056818Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1056913Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1056916Z 2025-12-04T13:44:26.1057153Z [rank3]:[W1204 13:35:31.282228567 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1057323Z [rank2]:[W1204 13:35:31.299992337 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1057552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1057808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1057973Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1058348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1058552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1058661Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1058759Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1058874Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1058876Z 2025-12-04T13:44:26.1059112Z [rank2]:[W1204 13:35:31.301143472 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1059299Z [rank1]:[W1204 13:35:31.892609465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1059490Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1059748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1059913Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1060287Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1060503Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1060610Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1060705Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1060808Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1060811Z 2025-12-04T13:44:26.1061042Z [rank1]:[W1204 13:35:31.893800769 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1061217Z [rank3]:[W1204 13:35:32.282350417 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1061395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1061652Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1061817Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1062187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1062389Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1062493Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1062590Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1062686Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1062687Z 2025-12-04T13:44:26.1062936Z [rank3]:[W1204 13:35:32.284266815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1063108Z [rank2]:[W1204 13:35:32.301231633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1063293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1063563Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1063731Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1064098Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1064315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1064422Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1064517Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1064614Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1064615Z 2025-12-04T13:44:26.1064853Z [rank2]:[W1204 13:35:32.302352358 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1065027Z [rank1]:[W1204 13:35:32.893984978 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1065203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1065463Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1065630Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1066003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1066206Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1066312Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1066408Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1066504Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1066506Z 2025-12-04T13:44:26.1066756Z [rank1]:[W1204 13:35:32.895954254 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1066934Z [rank3]:[W1204 13:35:33.284445614 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1067112Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1067386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1067597Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1067976Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1068192Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1068296Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1068395Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1068490Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1068492Z 2025-12-04T13:44:26.1068730Z [rank3]:[W1204 13:35:33.286100898 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1068904Z [rank2]:[W1204 13:35:33.302459179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1069080Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1069340Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1069510Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1069884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1070086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1070193Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1070290Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1070388Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1070390Z 2025-12-04T13:44:26.1070623Z [rank2]:[W1204 13:35:33.303595974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1070808Z [rank1]:[W1204 13:35:33.896087744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1070984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1071260Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1071438Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1071812Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1072015Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1072132Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1072228Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1072326Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1072328Z 2025-12-04T13:44:26.1072569Z [rank1]:[W1204 13:35:33.897987513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1072744Z [rank3]:[W1204 13:35:34.286211889 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1072917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1073185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1073349Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1073720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1073925Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1074032Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1074129Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1074227Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1074230Z 2025-12-04T13:44:26.1074465Z [rank3]:[W1204 13:35:34.288342152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1074638Z [rank2]:[W1204 13:35:34.303699425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1074824Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1075087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1075263Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1075645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1075846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1075952Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1076059Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1076159Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1076162Z 2025-12-04T13:44:26.1076399Z [rank2]:[W1204 13:35:34.305272780 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1076573Z [rank1]:[W1204 13:35:34.898163172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1076753Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1077013Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1077178Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1077573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1077778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1077886Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1077981Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1078080Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1078082Z 2025-12-04T13:44:26.1078319Z [rank1]:[W1204 13:35:34.900361004 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1078496Z [rank3]:[W1204 13:35:35.288459952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1078669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1078946Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1079112Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1079492Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1079708Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1079813Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1079909Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1080022Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1080024Z 2025-12-04T13:44:26.1080263Z [rank3]:[W1204 13:35:35.290545337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1080440Z [rank2]:[W1204 13:35:35.305353942 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1080615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1080871Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1081038Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1081409Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1081610Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1081718Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1081819Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1081919Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1081920Z 2025-12-04T13:44:26.1082160Z [rank2]:[W1204 13:35:35.306481287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1082237Z SKIPPED [8.4157s] (Test skipped due to missing import) [ 25%] 2025-12-04T13:44:26.1082419Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp SKIPPED [0.0002s] (Inaccurate results with fused SDPA kernels) [ 27%] 2025-12-04T13:44:26.1082724Z distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing I1204 13:35:35.743000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 93441 2025-12-04T13:44:26.1082879Z I1204 13:35:35.743000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 93442 2025-12-04T13:44:26.1083029Z I1204 13:35:35.744000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 93443 2025-12-04T13:44:26.1083194Z I1204 13:35:35.745000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 93444 2025-12-04T13:44:26.1083374Z [rank1]:[W1204 13:35:35.900528973 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1083552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1083815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1083978Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1084368Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1084578Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1084685Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1084783Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1084883Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1084886Z 2025-12-04T13:44:26.1085127Z [rank1]:[W1204 13:35:35.901770656 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1085299Z [rank3]:[W1204 13:35:36.290729886 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1085478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1085732Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1085905Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1086271Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1086478Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1086584Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1086682Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1086798Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1086800Z 2025-12-04T13:44:26.1087037Z [rank3]:[W1204 13:35:36.292711002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1087219Z [rank2]:[W1204 13:35:36.306611707 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1087412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1087701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1087874Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1088245Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1088463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1088570Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1088667Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1088768Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1088772Z 2025-12-04T13:44:26.1089005Z [rank2]:[W1204 13:35:36.308063535 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1089183Z [rank1]:[W1204 13:35:36.901928376 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1089361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1089617Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1089784Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1090155Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1090357Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1090464Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1090558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1090658Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1090660Z 2025-12-04T13:44:26.1090908Z [rank1]:[W1204 13:35:36.903195768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1091079Z [rank3]:[W1204 13:35:37.292868563 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1091270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1091540Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1091705Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1092072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1092288Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1092393Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1092488Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1092584Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1092586Z 2025-12-04T13:44:26.1092824Z [rank3]:[W1204 13:35:37.294677913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1093002Z [rank2]:[W1204 13:35:37.308221585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1093176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1093437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1093600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1093968Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1094171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1094275Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1094373Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1094469Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1094471Z 2025-12-04T13:44:26.1094717Z [rank2]:[W1204 13:35:37.310384548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1094889Z [rank1]:[W1204 13:35:37.903351899 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1095064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1095339Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1095514Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1095884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1096095Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1096202Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1096299Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1096394Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1096396Z 2025-12-04T13:44:26.1096634Z [rank1]:[W1204 13:35:37.904586511 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1096806Z [rank3]:[W1204 13:35:38.294854933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1096981Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1097235Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1097400Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1097812Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1098022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1098128Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1098224Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1098324Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1098326Z 2025-12-04T13:44:26.1098561Z [rank3]:[W1204 13:35:38.296059486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1098743Z [rank2]:[W1204 13:35:38.310496209 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1098918Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1099187Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1099364Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1104308Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1104521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1104664Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1104764Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1104862Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1104865Z 2025-12-04T13:44:26.1105099Z [rank2]:[W1204 13:35:38.311655134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1105270Z [rank1]:[W1204 13:35:38.904732312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1105446Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1105712Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1105877Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1106246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1106449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1106556Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1106653Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1106749Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1106754Z 2025-12-04T13:44:26.1106989Z [rank1]:[W1204 13:35:38.906135541 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1107165Z [rank3]:[W1204 13:35:39.296223547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1107355Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1107652Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1107836Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1108216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1108422Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1108528Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1108637Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1108735Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1108737Z 2025-12-04T13:44:26.1108970Z [rank3]:[W1204 13:35:39.297477359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1109141Z [rank2]:[W1204 13:35:39.311752396 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1109316Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1109578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1109745Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1110115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1110321Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1110426Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1110525Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1110621Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1110624Z 2025-12-04T13:44:26.1110859Z [rank2]:[W1204 13:35:39.312906820 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1111030Z [rank1]:[W1204 13:35:39.906288172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1111219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1111473Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1111642Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1112018Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1112229Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1112337Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1112432Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1112539Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1112541Z 2025-12-04T13:44:26.1112780Z [rank1]:[W1204 13:35:39.907656232 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1112951Z [rank3]:[W1204 13:35:40.297659959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1113127Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1113382Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1113547Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1113921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1114123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1114229Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1114323Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1114421Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1114423Z 2025-12-04T13:44:26.1114655Z [rank3]:[W1204 13:35:40.299510939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1114829Z [rank2]:[W1204 13:35:40.313010692 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1115006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1115272Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1115438Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1115814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1116030Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1116133Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1116230Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1116326Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1116338Z 2025-12-04T13:44:26.1116579Z [rank2]:[W1204 13:35:40.314153157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1116751Z [rank1]:[W1204 13:35:40.909260161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1116925Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1117188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1117350Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1117754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1117955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1118062Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1118160Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1118260Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1118263Z 2025-12-04T13:44:26.1118501Z [rank1]:[W1204 13:35:40.910755998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1118671Z [rank3]:[W1204 13:35:41.299690299 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1118846Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1119102Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1119280Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1119666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1119891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1119995Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1120089Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1120188Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1120190Z 2025-12-04T13:44:26.1120421Z [rank3]:[W1204 13:35:41.301245595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1120609Z [rank2]:[W1204 13:35:41.314241639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1120785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1121041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1121206Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1121575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1121781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1121886Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1121982Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1122078Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1122080Z 2025-12-04T13:44:26.1122315Z [rank2]:[W1204 13:35:41.315390864 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1122488Z [rank1]:[W1204 13:35:41.910947698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1122664Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1122919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1123092Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1123459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1123673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1123788Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1123885Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1123981Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1123983Z 2025-12-04T13:44:26.1124219Z [rank1]:[W1204 13:35:41.912837567 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1124399Z [rank3]:[W1204 13:35:42.301434955 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1124574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1124828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1124991Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1125360Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1125562Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1125668Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1125763Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1125860Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1125862Z 2025-12-04T13:44:26.1126096Z [rank3]:[W1204 13:35:42.303317774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1126268Z [rank2]:[W1204 13:35:42.315494116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1126446Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1126702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1126865Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1127239Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1127452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1127605Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1127702Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1127798Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1127800Z 2025-12-04T13:44:26.1128034Z [rank2]:[W1204 13:35:42.316644341 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1128206Z [rank1]:[W1204 13:35:42.913006548 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1128395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1128654Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1128816Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1129182Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1129384Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1129491Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1129590Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1129690Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1129691Z 2025-12-04T13:44:26.1129927Z [rank1]:[W1204 13:35:42.914828938 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1130097Z [rank3]:[W1204 13:35:43.303409976 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1130272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1130527Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1130691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1131070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1131272Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1131378Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1131489Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1131597Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1131599Z 2025-12-04T13:44:26.1131831Z [rank3]:[W1204 13:35:43.304554841 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1132001Z [rank2]:[W1204 13:35:43.316724584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1132175Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1132442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1132606Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1132975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1133186Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1133291Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1133389Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1133486Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1133489Z 2025-12-04T13:44:26.1133722Z [rank2]:[W1204 13:35:43.317834000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1133893Z [rank1]:[W1204 13:35:43.914999339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1134067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1134323Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1134485Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1134852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1135066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1135173Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1135270Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1135381Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1135393Z 2025-12-04T13:44:26.1135628Z [rank1]:[W1204 13:35:43.916826309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1135703Z SKIPPED [8.3153s] (Test skipped due to missing import) [ 30%] 2025-12-04T13:44:26.1135972Z distributed/test_dynamo_distributed.py::TestMultiProc::test_multiproc_autotune I1204 13:35:44.060000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 96065 2025-12-04T13:44:26.1136124Z I1204 13:35:44.061000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 96066 2025-12-04T13:44:26.1136286Z I1204 13:35:44.061000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 96067 2025-12-04T13:44:26.1136434Z I1204 13:35:44.062000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 96068 2025-12-04T13:44:26.1136611Z [rank3]:[W1204 13:35:44.304647674 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1136787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1137042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1137207Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1137607Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1137811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1137916Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1138012Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1138111Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1138114Z 2025-12-04T13:44:26.1138347Z [rank3]:[W1204 13:35:44.305791819 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1138521Z [rank2]:[W1204 13:35:44.317912602 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1138696Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1138969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1139133Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1139513Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1139729Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1139833Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1139936Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1140035Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1140050Z 2025-12-04T13:44:26.1140284Z [rank2]:[W1204 13:35:44.319028558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1140455Z [rank1]:[W1204 13:35:44.916979570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1140631Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1140884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1141047Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1141416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1141620Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1141725Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1141819Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1141916Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1141920Z 2025-12-04T13:44:26.1142152Z [rank1]:[W1204 13:35:44.918246422 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1142322Z [rank3]:[W1204 13:35:45.305938190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1142498Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1142752Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1142924Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1143302Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1143516Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1143620Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1143716Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1143816Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1143818Z 2025-12-04T13:44:26.1144052Z [rank3]:[W1204 13:35:45.307810819 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1144233Z [rank2]:[W1204 13:35:45.319136621 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1144410Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1144665Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1144829Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1145195Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1145399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1145503Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1145600Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1145695Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1145697Z 2025-12-04T13:44:26.1145930Z [rank2]:[W1204 13:35:45.320268406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1146101Z [rank1]:[W1204 13:35:45.918402554 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1146277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1146538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1146699Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1147080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1147293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1147408Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1147538Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1147635Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1147637Z 2025-12-04T13:44:26.1147871Z [rank1]:[W1204 13:35:45.919613597 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1148054Z [rank3]:[W1204 13:35:46.307920112 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1148231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1148486Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1148648Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1149014Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1149216Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1149323Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1149417Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1149513Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1149515Z 2025-12-04T13:44:26.1149751Z [rank3]:[W1204 13:35:46.309210384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1149921Z [rank2]:[W1204 13:35:46.320363069 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1150097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1150353Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1150521Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1150911Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1151115Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1151230Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1151337Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1151433Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1151435Z 2025-12-04T13:44:26.1151668Z [rank2]:[W1204 13:35:46.321734869 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1151838Z [rank1]:[W1204 13:35:46.919760359 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1152023Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1152280Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1152443Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1152809Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1153014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1153120Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1153215Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1153312Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1153314Z 2025-12-04T13:44:26.1153547Z [rank1]:[W1204 13:35:46.921024591 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1153716Z [rank3]:[W1204 13:35:47.309374385 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1153892Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1154152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1154318Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1154691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1154899Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1155005Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1155109Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1155216Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1155218Z 2025-12-04T13:44:26.1155451Z [rank3]:[W1204 13:35:47.310617488 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1155622Z [rank2]:[W1204 13:35:47.321867871 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1155797Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1156068Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1156231Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1156604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1156806Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1156911Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1157009Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1157105Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1157107Z 2025-12-04T13:44:26.1157339Z [rank2]:[W1204 13:35:47.323973295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1157549Z [rank1]:[W1204 13:35:47.921181393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1157725Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1157981Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1158143Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1158518Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1158731Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1158839Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1158934Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1159043Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1159057Z 2025-12-04T13:44:26.1159292Z [rank1]:[W1204 13:35:47.923036303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1159462Z [rank3]:[W1204 13:35:48.310782780 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1159639Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1159892Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1160070Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1160440Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1160645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1160751Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1160847Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1160945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1160948Z 2025-12-04T13:44:26.1161182Z [rank3]:[W1204 13:35:48.312029892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1161352Z [rank2]:[W1204 13:35:48.324097628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1161527Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1161783Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1161947Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1162315Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1162519Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1162634Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1162730Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1162829Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1162831Z 2025-12-04T13:44:26.1163075Z [rank2]:[W1204 13:35:48.325771821 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1163255Z [rank1]:[W1204 13:35:48.923178325 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1163431Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1163685Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1163864Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1164238Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1164443Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1164548Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1164642Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1164741Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1164742Z 2025-12-04T13:44:26.1164975Z [rank1]:[W1204 13:35:48.925336078 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1165144Z [rank3]:[W1204 13:35:49.312191845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1165318Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1165572Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1165736Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1166103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1166309Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1166413Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1166518Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1166615Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1166618Z 2025-12-04T13:44:26.1166863Z [rank3]:[W1204 13:35:49.313535985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1167044Z [rank2]:[W1204 13:35:49.325888824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1167219Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1167500Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1167664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1168052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1168255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1168359Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1168456Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1168552Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1168555Z 2025-12-04T13:44:26.1168789Z [rank2]:[W1204 13:35:49.327153676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1168961Z [rank1]:[W1204 13:35:49.925450571 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1169136Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1169392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1169553Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1169921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1170123Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1170227Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1170323Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1170432Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1170434Z 2025-12-04T13:44:26.1170670Z [rank1]:[W1204 13:35:49.927398158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1170853Z [rank3]:[W1204 13:35:50.313724287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1171046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1171300Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1171464Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1171833Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1172050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1172156Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1172252Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1172350Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1172351Z 2025-12-04T13:44:26.1172584Z [rank3]:[W1204 13:35:50.315116606 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1172756Z [rank2]:[W1204 13:35:50.327255700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1172930Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1173184Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1173349Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1173715Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1173918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1174023Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1174120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1174216Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1174218Z 2025-12-04T13:44:26.1174469Z [rank2]:[W1204 13:35:50.328409394 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1174643Z [rank1]:[W1204 13:35:50.927571040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1174827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1175093Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1175255Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1175623Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1175834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1175939Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1176036Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1176134Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1176136Z 2025-12-04T13:44:26.1176375Z [rank1]:[W1204 13:35:50.929841480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1176545Z [rank3]:[W1204 13:35:51.315214860 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1176727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1176986Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1177148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1177568Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1177771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1177877Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1177972Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1178070Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1178072Z 2025-12-04T13:44:26.1178317Z [rank3]:[W1204 13:35:51.316809725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1178489Z [rank2]:[W1204 13:35:51.328491688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1178666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1178939Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1179122Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1179488Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1179703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1179808Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1179905Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1180001Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1180003Z 2025-12-04T13:44:26.1180237Z [rank2]:[W1204 13:35:51.329598414 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1180407Z [rank1]:[W1204 13:35:51.929955934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1180582Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1180840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1181002Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1181368Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1181569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1181674Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1181772Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1181868Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1181870Z 2025-12-04T13:44:26.1182104Z [rank1]:[W1204 13:35:51.932067598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1182284Z [rank3]:[W1204 13:35:52.316989107 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1182462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1182726Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1182901Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1183268Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1183469Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1183588Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1183685Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1183783Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1183784Z 2025-12-04T13:44:26.1184016Z [rank3]:[W1204 13:35:52.318261019 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1184187Z [rank2]:[W1204 13:35:52.329667799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1184362Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1184619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1184785Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1185157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1185362Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1185469Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1185569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1185666Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1185669Z 2025-12-04T13:44:26.1185902Z [rank2]:[W1204 13:35:52.332260692 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1186073Z [rank1]:[W1204 13:35:52.932205911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1186258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1186516Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1186689Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1187066Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1187269Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1187384Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1187524Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1187620Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1187623Z 2025-12-04T13:44:26.1187857Z [rank1]:[W1204 13:35:52.934614028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1188027Z [rank3]:[W1204 13:35:53.318438451 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1188203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1188462Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1188628Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1188996Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1189198Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1189304Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1189399Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1189495Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1189498Z 2025-12-04T13:44:26.1189735Z [rank3]:[W1204 13:35:53.320492526 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1189909Z [rank2]:[W1204 13:35:53.332357996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1190097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1190355Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1190532Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1190911Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1191113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1191218Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1191316Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1191426Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1191430Z 2025-12-04T13:44:26.1191664Z [rank2]:[W1204 13:35:53.334384651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1191835Z [rank1]:[W1204 13:35:53.934768561 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1192010Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1192266Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1192428Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1192798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1193000Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1193105Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1193201Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1193299Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1193301Z 2025-12-04T13:44:26.1193535Z [rank1]:[W1204 13:35:53.937036271 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1193705Z [rank3]:[W1204 13:35:54.320650319 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1193880Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1194149Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1194316Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1194697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1194908Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1195013Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1195108Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1195205Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1195218Z 2025-12-04T13:44:26.1195453Z [rank3]:[W1204 13:35:54.322811922 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1195624Z [rank2]:[W1204 13:35:54.334464616 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1195800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1196058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1196227Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1196595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1196796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1196899Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1196996Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1197092Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1197096Z 2025-12-04T13:44:26.1197329Z [rank2]:[W1204 13:35:54.336693767 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1197548Z [rank1]:[W1204 13:35:54.937174994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1197724Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1197980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1198161Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1198545Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1198760Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1198864Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1198960Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1199057Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1199059Z 2025-12-04T13:44:26.1199298Z [rank1]:[W1204 13:35:55.939483994 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1199482Z [rank3]:[W1204 13:35:55.322955135 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1199659Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1199914Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1200078Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1200446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1200648Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1200752Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1200846Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1200943Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1200945Z 2025-12-04T13:44:26.1201178Z [rank3]:[W1204 13:35:55.324144979 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1201351Z [rank2]:[W1204 13:35:55.336809501 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1201527Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1201780Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1201955Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1202322Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1202540Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1202655Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1202751Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1202848Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1202853Z 2025-12-04T13:44:26.1203086Z [rank2]:[W1204 13:35:55.338660811 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1203269Z [rank1]:[W1204 13:35:56.939655847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1203445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1203703Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1203865Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1204233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1204435Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1204540Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1204636Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1204731Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1204733Z 2025-12-04T13:44:26.1204968Z [rank1]:[W1204 13:35:56.941818659 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1205138Z [rank3]:[W1204 13:35:56.324342992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1205316Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1205573Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1205738Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1206116Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1206327Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1206441Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1206536Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1206633Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1206635Z 2025-12-04T13:44:26.1206869Z [rank3]:[W1204 13:35:56.326027015 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1207039Z [rank2]:[W1204 13:35:56.338840033 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1207226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1207541Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1207710Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1208081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1208285Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1208389Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1208486Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1208584Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1208586Z 2025-12-04T13:44:26.1208819Z [rank2]:[W1204 13:35:56.341111514 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1208992Z [rank1]:[W1204 13:35:57.941996882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1209166Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1209421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1209584Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1209976Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1210178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1210296Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1210403Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1210498Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1210500Z 2025-12-04T13:44:26.1210732Z [rank1]:[W1204 13:35:57.944027398 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1210902Z [rank3]:[W1204 13:35:57.326218527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1211078Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1211350Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1211514Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1211886Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1212089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1212195Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1212291Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1212389Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1212391Z 2025-12-04T13:44:26.1212622Z [rank3]:[W1204 13:35:57.328077947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1212795Z [rank2]:[W1204 13:35:57.341221518 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1212971Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1213226Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1213390Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1213756Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1213970Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1214076Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1214174Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1214282Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1214294Z 2025-12-04T13:44:26.1214527Z [rank2]:[W1204 13:35:57.343155696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1214698Z [rank1]:[W1204 13:35:58.944162312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1214872Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1215139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1215303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1215670Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1215872Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1215976Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1216073Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1216169Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1216171Z 2025-12-04T13:44:26.1216406Z [rank1]:[W1204 13:35:58.946407102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1216576Z [rank3]:[W1204 13:35:58.329144140 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1216752Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1217006Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1217169Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1217576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1217790Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1217895Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1217991Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1218088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1218104Z 2025-12-04T13:44:26.1218349Z [rank3]:[W1204 13:35:58.330607328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1218522Z [rank2]:[W1204 13:35:58.343241101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1218699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1218954Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1219133Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1219499Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1219701Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1219808Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1219905Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1220003Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1220005Z 2025-12-04T13:44:26.1220237Z [rank2]:[W1204 13:35:58.345570590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1220410Z [rank1]:[W1204 13:35:59.946578666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1220584Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1220842Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1221007Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1221375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1221578Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1221692Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1221788Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1221885Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1221887Z 2025-12-04T13:44:26.1222129Z [rank1]:[W1204 13:35:59.949096651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1222309Z [rank3]:[W1204 13:35:59.330721723 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1222485Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1222744Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1222925Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1223293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1223493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1223598Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1223692Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1223789Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1223792Z 2025-12-04T13:44:26.1224026Z [rank3]:[W1204 13:35:59.332269449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1224200Z [rank2]:[W1204 13:35:59.345647495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1224375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1224631Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1224796Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1225171Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1225373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1225478Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1225582Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1225681Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1225684Z 2025-12-04T13:44:26.1225916Z [rank2]:[W1204 13:35:59.347331858 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1226099Z [rank1]:[W1204 13:36:00.949279944 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1226283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1226540Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1226702Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1227085Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1227290Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1227394Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1227524Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1227620Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1227622Z 2025-12-04T13:44:26.1227857Z [rank1]:[W1204 13:36:00.951642082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1228030Z [rank3]:[W1204 13:36:00.332419733 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1228206Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1228465Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1228629Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1228996Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1229198Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1229304Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1229399Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1229499Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1229517Z 2025-12-04T13:44:26.1229753Z [rank3]:[W1204 13:36:00.334020808 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1229936Z [rank2]:[W1204 13:36:00.347434243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1230124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1230378Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1230543Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1230910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1231132Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1231238Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1231335Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1231433Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1231435Z 2025-12-04T13:44:26.1231670Z [rank2]:[W1204 13:36:00.349425270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1231841Z [rank1]:[W1204 13:36:01.951796816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1232015Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1232271Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1232434Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1232802Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1233008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1233112Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1233208Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1233305Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1233307Z 2025-12-04T13:44:26.1233552Z [rank1]:[W1204 13:36:01.953553787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1233725Z [rank3]:[W1204 13:36:01.334175782 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1233911Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1234179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1234341Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1234708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1234920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1235027Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1235122Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1235219Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1235221Z 2025-12-04T13:44:26.1235456Z [rank3]:[W1204 13:36:01.335417155 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1235626Z [rank2]:[W1204 13:36:01.349560184 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1235803Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1236058Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1236224Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1236590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1236793Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1236900Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1236996Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1237094Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1237096Z 2025-12-04T13:44:26.1237330Z [rank2]:[W1204 13:36:01.351868664 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1237551Z [rank1]:[W1204 13:36:02.953707672 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1237727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1237998Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1238185Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1238551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1238753Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1238870Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1238968Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1239066Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1239068Z 2025-12-04T13:44:26.1239304Z [rank1]:[W1204 13:36:02.954963244 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1239475Z [rank3]:[W1204 13:36:02.335622438 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1239651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1239908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1240071Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1240438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1240638Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1240744Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1240840Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1240938Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1240940Z 2025-12-04T13:44:26.1241173Z [rank3]:[W1204 13:36:02.338015755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1241354Z [rank2]:[W1204 13:36:02.351984009 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1241532Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1241803Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1241979Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1242347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1242550Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1242668Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1242763Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1242863Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1242866Z 2025-12-04T13:44:26.1243097Z [rank2]:[W1204 13:36:02.353869477 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1243268Z [rank1]:[W1204 13:36:03.955068450 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1243442Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1243701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1243864Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1244233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1244435Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1244539Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1244637Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1244733Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1244736Z 2025-12-04T13:44:26.1244970Z [rank1]:[W1204 13:36:03.956298453 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1245138Z [rank3]:[W1204 13:36:03.338414125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1245326Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1245583Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1245757Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1246136Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1246340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1246445Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1246551Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1246649Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1246651Z 2025-12-04T13:44:26.1246885Z [rank3]:[W1204 13:36:03.340276484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1247056Z [rank2]:[W1204 13:36:03.354008662 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1247234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1247528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1247696Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1248064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1248268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1248374Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1248469Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1248569Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1248571Z 2025-12-04T13:44:26.1248803Z [rank2]:[W1204 13:36:03.355622227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1248975Z [rank1]:[W1204 13:36:04.956411598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1249147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1249419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1249585Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1249962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1250177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1250283Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1250379Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1250488Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1250490Z 2025-12-04T13:44:26.1250723Z [rank1]:[W1204 13:36:04.957662031 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1250894Z [rank3]:[W1204 13:36:04.340438568 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1251070Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1251326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1251488Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1251858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1252060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1252166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1252262Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1252362Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1252365Z 2025-12-04T13:44:26.1252604Z [rank3]:[W1204 13:36:04.341793099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1252775Z [rank2]:[W1204 13:36:04.355735412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1252950Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1253213Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1253378Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1253761Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1253974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1254080Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1254176Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1254277Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1254290Z 2025-12-04T13:44:26.1254528Z [rank2]:[W1204 13:36:04.357574392 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1254700Z [rank1]:[W1204 13:36:05.957779167 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1254875Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1255130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1255293Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1255659Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1255864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1255969Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1256065Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1256161Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1256163Z 2025-12-04T13:44:26.1256397Z [rank1]:[W1204 13:36:05.959079128 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1256569Z [rank3]:[W1204 13:36:05.341984422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1256744Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1257000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1257173Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1257593Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1257808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1257913Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1258008Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1258106Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1258108Z 2025-12-04T13:44:26.1258343Z [rank3]:[W1204 13:36:05.344107606 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1258526Z [rank2]:[W1204 13:36:05.357683968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1258703Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1258958Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1259124Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1259489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1259695Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1259801Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1259896Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1259994Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1259996Z 2025-12-04T13:44:26.1260230Z [rank2]:[W1204 13:36:05.358857532 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1260403Z [rank1]:[W1204 13:36:06.959252933 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1260578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1260836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1261002Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1261381Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1261596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1261710Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1261806Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1261901Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1261903Z 2025-12-04T13:44:26.1262137Z [rank1]:[W1204 13:36:06.960666431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1262308Z [rank3]:[W1204 13:36:06.344276921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1262495Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1262753Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1262915Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1263283Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1263485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1263591Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1263689Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1263785Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1263789Z 2025-12-04T13:44:26.1264022Z [rank3]:[W1204 13:36:06.346161799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1264190Z [rank2]:[W1204 13:36:06.358979838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1264366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1264623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1264791Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1265166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1265369Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1265485Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1265593Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1265691Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1265693Z 2025-12-04T13:44:26.1265924Z [rank2]:[W1204 13:36:06.360245450 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1266095Z [rank1]:[W1204 13:36:07.960850636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1266279Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1266536Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1266700Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1267070Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1267272Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1267376Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1267504Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1267601Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1267603Z 2025-12-04T13:44:26.1267836Z [rank1]:[W1204 13:36:07.962938690 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1268007Z [rank3]:[W1204 13:36:07.346349114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1268182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1268440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1268602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1268989Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1269192Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1269299Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1269409Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1269519Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1269521Z 2025-12-04T13:44:26.1269755Z [rank3]:[W1204 13:36:07.348778660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1269925Z [rank2]:[W1204 13:36:07.360374546 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1270103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1270374Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1270539Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1270905Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1271107Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1271213Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1271309Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1271406Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1271409Z 2025-12-04T13:44:26.1271640Z [rank2]:[W1204 13:36:07.362512159 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1271810Z [rank1]:[W1204 13:36:08.963053556 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1271984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1272242Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1272407Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1272777Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1272989Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1273093Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1273190Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1273295Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1273307Z 2025-12-04T13:44:26.1273541Z [rank1]:[W1204 13:36:08.964357608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1273713Z [rank3]:[W1204 13:36:08.348942345 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1273888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1274145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1274318Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1274690Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1274893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1274999Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1275097Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1275192Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1275194Z 2025-12-04T13:44:26.1275428Z [rank3]:[W1204 13:36:08.350874233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1275598Z [rank2]:[W1204 13:36:08.362632425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1275773Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1276028Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1276193Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1276561Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1276764Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1276882Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1276977Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1277075Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1277077Z 2025-12-04T13:44:26.1277317Z [rank2]:[W1204 13:36:08.364945424 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1277525Z [rank1]:[W1204 13:36:09.964519423 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1277700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1277955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1278135Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1278500Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1278702Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1278807Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1278904Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1279001Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1279003Z 2025-12-04T13:44:26.1279243Z [rank1]:[W1204 13:36:09.965829564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1279415Z [rank3]:[W1204 13:36:09.351081817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1279588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1279844Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1280006Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1280372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1280573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1280679Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1280788Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1280884Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1280886Z 2025-12-04T13:44:26.1281135Z [rank3]:[W1204 13:36:09.353061904 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1281319Z [rank2]:[W1204 13:36:09.365047931 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1281494Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1281749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1281912Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1282297Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1282499Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1282604Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1282698Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1282795Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1282797Z 2025-12-04T13:44:26.1283030Z [rank2]:[W1204 13:36:09.366196095 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1283203Z [rank1]:[W1204 13:36:10.966005519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1283378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1283635Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1283800Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1284169Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1284374Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1284478Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1284576Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1284682Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1284684Z 2025-12-04T13:44:26.1284918Z [rank1]:[W1204 13:36:10.967938627 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1285100Z [rank3]:[W1204 13:36:10.353246269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1285286Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1285546Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1285709Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1286076Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1286292Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1286397Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1286493Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1286588Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1286590Z 2025-12-04T13:44:26.1286822Z [rank3]:[W1204 13:36:10.355113008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1286992Z [rank2]:[W1204 13:36:10.366415280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1287169Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1287429Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1287631Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1288002Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1288206Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1288311Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1288406Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1288503Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1288505Z 2025-12-04T13:44:26.1288752Z [rank2]:[W1204 13:36:10.368293319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1288926Z [rank1]:[W1204 13:36:11.968100052 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1289115Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1289384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1289549Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1289918Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1290135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1290240Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1290336Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1290437Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1290441Z 2025-12-04T13:44:26.1290674Z [rank1]:[W1204 13:36:11.969767466 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1290848Z [rank3]:[W1204 13:36:11.355302653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1291024Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1291283Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1291444Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1291816Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1292021Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1292126Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1292223Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1292319Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1292321Z 2025-12-04T13:44:26.1292570Z [rank3]:[W1204 13:36:11.356892288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1292741Z [rank2]:[W1204 13:36:11.368385696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1292921Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1293189Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1293363Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1293733Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1293944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1294054Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1294150Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1294249Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1294252Z 2025-12-04T13:44:26.1294485Z [rank2]:[W1204 13:36:11.370192466 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1294657Z [rank1]:[W1204 13:36:12.969945001 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1294834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1295090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1295254Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1295619Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1295821Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1295926Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1296024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1296121Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1296124Z 2025-12-04T13:44:26.1296357Z [rank1]:[W1204 13:36:12.971588875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1296540Z [rank3]:[W1204 13:36:12.357059394 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1296715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1296985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1297165Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1297580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1297784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1297901Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1297998Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1298094Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1298096Z 2025-12-04T13:44:26.1298329Z [rank3]:[W1204 13:36:12.358448363 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1298500Z [rank2]:[W1204 13:36:12.370266924 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1298677Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1298934Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1299099Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1299467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1299670Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1299778Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1299875Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1299974Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1299977Z 2025-12-04T13:44:26.1300208Z [rank2]:[W1204 13:36:12.371455738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1300378Z [rank1]:[W1204 13:36:13.971718522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1300568Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1300824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1301003Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1301384Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1301588Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1301692Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1301799Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1301897Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1301901Z 2025-12-04T13:44:26.1302135Z [rank1]:[W1204 13:36:13.973022713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1302306Z [rank3]:[W1204 13:36:13.358610979 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1302481Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1302739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1302903Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1303274Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1303477Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1303580Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1303678Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1303774Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1303776Z 2025-12-04T13:44:26.1304009Z [rank3]:[W1204 13:36:13.360254333 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1304179Z [rank2]:[W1204 13:36:13.371540585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1304355Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1304622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1304789Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1305166Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1305378Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1305484Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1305579Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1305687Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1305689Z 2025-12-04T13:44:26.1305923Z [rank2]:[W1204 13:36:13.373244608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1306094Z [rank1]:[W1204 13:36:14.973163049 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1306270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1306525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1306690Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1307059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1307263Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1307367Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1307464Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1307595Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1307598Z 2025-12-04T13:44:26.1307832Z [rank1]:[W1204 13:36:14.974572589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1308006Z [rank3]:[W1204 13:36:14.360446768 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1308181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1308453Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1308616Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1308997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1309212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1309316Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1309413Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1309509Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1309523Z 2025-12-04T13:44:26.1309759Z [rank3]:[W1204 13:36:14.362338287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1309930Z [rank2]:[W1204 13:36:14.373361725 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1310106Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1310361Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1310526Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1310897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1311098Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1311206Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1311302Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1311402Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1311404Z 2025-12-04T13:44:26.1311637Z [rank2]:[W1204 13:36:14.374736655 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1311809Z [rank1]:[W1204 13:36:15.974752234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1311986Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1312240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1312413Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1312796Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1313008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1313112Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1313209Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1313307Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1313309Z 2025-12-04T13:44:26.1313543Z [rank1]:[W1204 13:36:15.976494266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1313728Z [rank3]:[W1204 13:36:15.362479304 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1313904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1314164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1314327Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1314696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1314900Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1315004Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1315100Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1315195Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1315197Z 2025-12-04T13:44:26.1315432Z [rank3]:[W1204 13:36:15.363720726 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1315604Z [rank2]:[W1204 13:36:15.374836852 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1315781Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1316038Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1316199Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1316575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1316788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1316904Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1316999Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1317100Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1317101Z 2025-12-04T13:44:26.1317335Z [rank2]:[W1204 13:36:15.376025506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1317559Z [rank1]:[W1204 13:36:16.976696832 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1317737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1317996Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1318159Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1318525Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1318730Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1318836Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1318931Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1319029Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1319031Z 2025-12-04T13:44:26.1319265Z [rank1]:[W1204 13:36:16.978511492 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1319437Z [rank3]:[W1204 13:36:16.363909592 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1319613Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1319870Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1320033Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1320420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1320624Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1320741Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1320849Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1320945Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1320947Z 2025-12-04T13:44:26.1321181Z [rank3]:[W1204 13:36:16.367415465 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1321350Z [rank2]:[W1204 13:36:16.376138984 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1321539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1321799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1321963Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1322335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1322538Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1322645Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1322741Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1322839Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1322840Z 2025-12-04T13:44:26.1323073Z [rank2]:[W1204 13:36:16.378105681 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1323247Z [rank1]:[W1204 13:36:17.978672468 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1323423Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1323680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1323844Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1324220Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1324424Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1324533Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1324640Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1324748Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1324750Z 2025-12-04T13:44:26.1324982Z [rank1]:[W1204 13:36:17.980571787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1325155Z [rank3]:[W1204 13:36:17.367576112 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1325330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1325602Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1325765Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1326136Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1326340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1326445Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1326545Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1326643Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1326645Z 2025-12-04T13:44:26.1326880Z [rank3]:[W1204 13:36:17.369628177 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1327051Z [rank2]:[W1204 13:36:17.378195739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1327229Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1327523Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1327686Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1328054Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1328277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1328384Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1328479Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1328590Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1328604Z 2025-12-04T13:44:26.1328844Z [rank2]:[W1204 13:36:17.380591966 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1329015Z [rank1]:[W1204 13:36:18.980732623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1329193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1329451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1329630Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1329997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1330201Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1330307Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1330403Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1330504Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1330506Z 2025-12-04T13:44:26.1330740Z [rank1]:[W1204 13:36:18.982827017 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1330913Z [rank3]:[W1204 13:36:18.369810583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1331090Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1331350Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1331512Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1331881Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1332086Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1332200Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1332300Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1332396Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1332398Z 2025-12-04T13:44:26.1332642Z [rank3]:[W1204 13:36:18.371952846 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1332823Z [rank2]:[W1204 13:36:18.380688454 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1332998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1333259Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1333437Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1333806Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1334009Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1334118Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1334214Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1334313Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1334315Z 2025-12-04T13:44:26.1334551Z [rank2]:[W1204 13:36:18.382761249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1334721Z [rank1]:[W1204 13:36:19.982996934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1334898Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1335153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1335319Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1335689Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1335893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1335999Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1336103Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1336202Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1336205Z 2025-12-04T13:44:26.1336448Z [rank1]:[W1204 13:36:19.984949961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1336629Z [rank3]:[W1204 13:36:19.372401617 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1336804Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1337063Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1337227Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1337648Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1337853Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1337957Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1338055Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1338151Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1338154Z 2025-12-04T13:44:26.1338389Z [rank3]:[W1204 13:36:19.374376343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1338560Z [rank2]:[W1204 13:36:19.382876357 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1338737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1338998Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1339161Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1339530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1339733Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1339839Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1339935Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1340047Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1340050Z 2025-12-04T13:44:26.1340286Z [rank2]:[W1204 13:36:19.385272774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1340469Z [rank1]:[W1204 13:36:20.985123498 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1340660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1340916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1341079Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1341446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1341670Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1341776Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1341873Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1341972Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1341974Z 2025-12-04T13:44:26.1342205Z [rank1]:[W1204 13:36:20.986574266 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1342378Z [rank3]:[W1204 13:36:20.374534670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1342555Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1342814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1342980Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1343347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1343553Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1343657Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1343754Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1343849Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1343851Z 2025-12-04T13:44:26.1344096Z [rank3]:[W1204 13:36:20.376704163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1344267Z [rank2]:[W1204 13:36:20.385368703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1344454Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1344721Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1344884Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1345254Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1345466Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1345573Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1345669Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1345767Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1345769Z 2025-12-04T13:44:26.1346003Z [rank2]:[W1204 13:36:20.387770750 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1346173Z [rank1]:[W1204 13:36:21.986762053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1346350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1346606Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1346769Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1347138Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1347343Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1347449Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1347582Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1347680Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1347681Z 2025-12-04T13:44:26.1347926Z [rank1]:[W1204 13:36:21.988466595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1348099Z [rank3]:[W1204 13:36:21.377108085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1348275Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1348544Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1348720Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1349086Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1349302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1349407Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1349506Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1349602Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1349603Z 2025-12-04T13:44:26.1349838Z [rank3]:[W1204 13:36:21.378470265 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1350008Z [rank2]:[W1204 13:36:21.387897728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1350185Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1350441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1350605Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1350973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1351174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1351279Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1351376Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1351476Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1351477Z 2025-12-04T13:44:26.1351712Z [rank2]:[W1204 13:36:21.389437854 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1351892Z [rank1]:[W1204 13:36:22.988652812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1356297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1356575Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1356751Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1357120Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1357322Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1357442Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1357580Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1357679Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1357681Z 2025-12-04T13:44:26.1357915Z [rank1]:[W1204 13:36:22.990414423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1358089Z [rank3]:[W1204 13:36:22.378641602 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1358264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1358525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1358691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1359056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1359259Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1359366Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1359465Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1359562Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1359565Z 2025-12-04T13:44:26.1359799Z [rank3]:[W1204 13:36:22.379880885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1359968Z [rank2]:[W1204 13:36:22.389569012 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1360162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1360423Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1360602Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1360993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1361195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1361316Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1361411Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1361509Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1361512Z 2025-12-04T13:44:26.1361748Z [rank2]:[W1204 13:36:22.391345573 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1361919Z [rank1]:[W1204 13:36:23.990598560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1362095Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1362351Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1362520Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1362887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1363090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1363196Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1363291Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1363390Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1363392Z 2025-12-04T13:44:26.1363626Z [rank1]:[W1204 13:36:23.991904202 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1363797Z [rank3]:[W1204 13:36:23.380072482 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1363982Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1364238Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1364413Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1364795Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1364996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1365101Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1365198Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1365305Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1365307Z 2025-12-04T13:44:26.1365540Z [rank3]:[W1204 13:36:23.381880702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1365712Z [rank2]:[W1204 13:36:23.391476621 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1365887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1366144Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1366308Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1366676Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1366878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1366984Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1367079Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1367177Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1367179Z 2025-12-04T13:44:26.1367413Z [rank2]:[W1204 13:36:23.393750221 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1367620Z [rank1]:[W1204 13:36:24.992079929 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1367796Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1368065Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1368230Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1368608Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1368823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1368928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1369024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1369121Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1369137Z 2025-12-04T13:44:26.1369371Z [rank1]:[W1204 13:36:24.994438327 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1369543Z [rank3]:[W1204 13:36:24.382066889 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1369717Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1369974Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1370137Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1370504Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1370706Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1370809Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1370909Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1371004Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1371007Z 2025-12-04T13:44:26.1371245Z [rank3]:[W1204 13:36:24.383873409 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1371418Z [rank2]:[W1204 13:36:24.393865460 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1371592Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1371847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1372020Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1372397Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1372608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1372714Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1372809Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1372906Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1372908Z 2025-12-04T13:44:26.1373142Z [rank2]:[W1204 13:36:24.396391955 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1373325Z [rank1]:[W1204 13:36:25.994581115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1373504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1373762Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1373925Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1374290Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1374495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1374600Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1374695Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1374791Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1374793Z 2025-12-04T13:44:26.1375025Z [rank1]:[W1204 13:36:25.995951845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1375196Z [rank3]:[W1204 13:36:25.384062287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1375372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1375632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1375810Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1376174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1376395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1376510Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1376606Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1376701Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1376703Z 2025-12-04T13:44:26.1376937Z [rank3]:[W1204 13:36:25.386386976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1377118Z [rank2]:[W1204 13:36:25.396508353 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1377293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1377577Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1377741Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1378109Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1378312Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1378418Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1378515Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1378610Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1378612Z 2025-12-04T13:44:26.1378846Z [rank2]:[W1204 13:36:25.398884281 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1379017Z [rank1]:[W1204 13:36:26.996082854 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1379192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1379446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1379611Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1379998Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1380213Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1380337Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1380433Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1380529Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1380531Z 2025-12-04T13:44:26.1380763Z [rank1]:[W1204 13:36:26.997315417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1380934Z [rank3]:[W1204 13:36:26.386541174 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1381124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1381380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1381546Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1381915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1382122Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1382227Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1382325Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1382421Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1382423Z 2025-12-04T13:44:26.1382657Z [rank3]:[W1204 13:36:26.390941267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1382828Z [rank2]:[W1204 13:36:26.399035100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1383003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1383258Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1383421Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1383797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1383999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1384113Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1384219Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1384316Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1384318Z 2025-12-04T13:44:26.1384556Z [rank2]:[W1204 13:36:26.401398298 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1384726Z [rank1]:[W1204 13:36:27.997496274 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1384912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1385167Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1385330Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1385695Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1385896Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1386003Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1386098Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1386196Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1386198Z 2025-12-04T13:44:26.1386431Z [rank1]:[W1204 13:36:27.999657937 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1386602Z [rank3]:[W1204 13:36:27.391052576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1386777Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1387037Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1387200Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1387605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1387821Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1387926Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1388022Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1388130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1388145Z 2025-12-04T13:44:26.1388379Z [rank3]:[W1204 13:36:27.392224391 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1388549Z [rank2]:[W1204 13:36:27.401493797 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1388725Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1388994Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1389157Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1389523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1389726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1389832Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1389929Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1390026Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1390030Z 2025-12-04T13:44:26.1390264Z [rank2]:[W1204 13:36:27.403841256 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1390434Z [rank1]:[W1204 13:36:28.999772356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1390610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1390866Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1391032Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1391400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1391618Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1391724Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1391819Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1391915Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1391927Z 2025-12-04T13:44:26.1392171Z [rank1]:[W1204 13:36:28.001036608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1392343Z [rank3]:[W1204 13:36:28.392333310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1392518Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1392775Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1392954Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1393320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1393521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1393626Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1393724Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1393820Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1393825Z 2025-12-04T13:44:26.1394058Z [rank3]:[W1204 13:36:28.393502034 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1394228Z [rank2]:[W1204 13:36:28.403948605 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1394402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1394657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1394819Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1395192Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1395396Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1395510Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1395607Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1395706Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1395708Z 2025-12-04T13:44:26.1395951Z [rank2]:[W1204 13:36:28.406208916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1396130Z [rank1]:[W1204 13:36:29.001149978 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1396306Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1396560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1396737Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1397103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1397305Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1397410Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1397549Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1397645Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1397648Z 2025-12-04T13:44:26.1397883Z [rank1]:[W1204 13:36:29.002399620 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1398054Z [rank3]:[W1204 13:36:29.393692032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1398231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1398487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1398650Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1399016Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1399220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1399323Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1399435Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1399532Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1399536Z 2025-12-04T13:44:26.1399769Z [rank3]:[W1204 13:36:29.395385635 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1399952Z [rank2]:[W1204 13:36:29.406329205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1400139Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1400392Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1400555Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1400938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1401140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1401244Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1401340Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1401437Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1401439Z 2025-12-04T13:44:26.1401672Z [rank2]:[W1204 13:36:29.408230543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1401843Z [rank1]:[W1204 13:36:30.002561809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1402019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1402271Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1402433Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1402798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1402999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1403103Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1403198Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1403295Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1403307Z 2025-12-04T13:44:26.1403541Z [rank1]:[W1204 13:36:30.004623244 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1403725Z [rank3]:[W1204 13:36:30.395554004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1403914Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1404172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1404336Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1404699Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1404918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1405023Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1405120Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1405215Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1405219Z 2025-12-04T13:44:26.1405452Z [rank3]:[W1204 13:36:30.397179558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1405623Z [rank2]:[W1204 13:36:30.408343323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1405797Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1406054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1406217Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1406583Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1406788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1406893Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1406989Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1407084Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1407086Z 2025-12-04T13:44:26.1407329Z [rank2]:[W1204 13:36:30.410720641 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1407538Z [rank1]:[W1204 13:36:31.004781003 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1407727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1407993Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1408158Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1408526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1408740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1408847Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1408943Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1409040Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1409042Z 2025-12-04T13:44:26.1409277Z [rank1]:[W1204 13:36:31.006412797 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1409447Z [rank3]:[W1204 13:36:31.397333267 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1409624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1409881Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1410045Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1410413Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1410619Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1410724Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1410821Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1410917Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1410920Z 2025-12-04T13:44:26.1411151Z [rank3]:[W1204 13:36:31.398951691 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1411332Z [rank2]:[W1204 13:36:31.410828131 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1411508Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1411774Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1411947Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1412316Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1412519Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1412636Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1412736Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1412832Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1412834Z 2025-12-04T13:44:26.1413067Z [rank2]:[W1204 13:36:31.413186239 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1413236Z [rank1]:[W1204 13:36:32.006566186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1413411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1413668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1413831Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1414199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1414399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1414506Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1414601Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1414699Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1414701Z 2025-12-04T13:44:26.1414936Z [rank1]:[W1204 13:36:32.009071461 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1415118Z [rank3]:[W1204 13:36:32.399106330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1415294Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1415560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1415732Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1416095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1416298Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1416413Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1416510Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1416607Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1416610Z 2025-12-04T13:44:26.1416842Z [rank3]:[W1204 13:36:32.401109457 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1417014Z [rank2]:[W1204 13:36:32.413289409 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1417189Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1417445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1417649Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1418017Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1418220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1418325Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1418423Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1418520Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1418523Z 2025-12-04T13:44:26.1418758Z [rank2]:[W1204 13:36:32.415695906 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1418928Z [rank1]:[W1204 13:36:33.009210650 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1419119Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1419377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1419555Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1419942Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1420143Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1420250Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1420358Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1420455Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1420457Z 2025-12-04T13:44:26.1420690Z [rank1]:[W1204 13:36:33.011673536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1420863Z [rank3]:[W1204 13:36:33.401243016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1421039Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1421295Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1421460Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1421829Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1422032Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1422138Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1422235Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1422333Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1422335Z 2025-12-04T13:44:26.1422567Z [rank3]:[W1204 13:36:33.402771642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1422738Z [rank2]:[W1204 13:36:33.415825716 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1422913Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1423181Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1423344Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1423723Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1423937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1424041Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1424137Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1424243Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1424245Z 2025-12-04T13:44:26.1424480Z [rank2]:[W1204 13:36:33.418223774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1424651Z [rank1]:[W1204 13:36:34.011809036 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1424826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1425082Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1425245Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1425614Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1425816Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1425922Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1426018Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1426116Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1426119Z 2025-12-04T13:44:26.1426353Z [rank1]:[W1204 13:36:34.014253032 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1426524Z [rank3]:[W1204 13:36:34.402874143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1426699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1426966Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1427130Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1427540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1427755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1427860Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1427957Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1428054Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1428071Z 2025-12-04T13:44:26.1428305Z [rank3]:[W1204 13:36:34.404774911 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1428477Z [rank2]:[W1204 13:36:34.418321884 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1428652Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1428910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1429072Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1429441Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1429644Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1429747Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1429844Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1429940Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1429942Z 2025-12-04T13:44:26.1430179Z [rank2]:[W1204 13:36:34.420457817 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1430352Z [rank1]:[W1204 13:36:35.014396372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1430527Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1430785Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1430960Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1431338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1431550Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1431655Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1431751Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1431847Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1431850Z 2025-12-04T13:44:26.1432084Z [rank1]:[W1204 13:36:35.016189503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1432265Z [rank3]:[W1204 13:36:35.405399951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1432443Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1432699Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1432862Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1433230Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1433434Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1433541Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1433636Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1433733Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1433735Z 2025-12-04T13:44:26.1433968Z [rank3]:[W1204 13:36:35.407435716 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1434142Z [rank2]:[W1204 13:36:35.420563158 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1434317Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1434576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1434740Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1435115Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1435335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1435450Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1435547Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1435643Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1435646Z 2025-12-04T13:44:26.1435880Z [rank2]:[W1204 13:36:35.422883797 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1436048Z [rank1]:[W1204 13:36:36.016355022 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1436235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1436491Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1436654Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1437027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1437228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1437336Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1437431Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1437569Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1437570Z 2025-12-04T13:44:26.1437805Z [rank1]:[W1204 13:36:36.018118873 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1437974Z [rank3]:[W1204 13:36:36.407608675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1438150Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1438404Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1438570Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1438955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1439159Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1439277Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1439385Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1439481Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1439483Z 2025-12-04T13:44:26.1439715Z [rank3]:[W1204 13:36:36.409757918 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1439886Z [rank2]:[W1204 13:36:36.423007847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1440073Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1440329Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1440494Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1440862Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1441063Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1441169Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1441266Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1441366Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1441368Z 2025-12-04T13:44:26.1441601Z [rank2]:[W1204 13:36:36.425398515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1441771Z [rank1]:[W1204 13:36:37.018283143 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1441946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1442206Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1442368Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1442745Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1442946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1443052Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1443157Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1443265Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1443266Z 2025-12-04T13:44:26.1443500Z [rank1]:[W1204 13:36:37.020175071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1443670Z [rank3]:[W1204 13:36:37.409871249 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1443845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1444112Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1444278Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1444642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1444845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1444951Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1445047Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1445144Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1445147Z 2025-12-04T13:44:26.1445380Z [rank3]:[W1204 13:36:37.411702669 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1445552Z [rank2]:[W1204 13:36:37.425491226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1445728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1445984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1446147Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1446515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1446732Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1446836Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1446934Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1447039Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1447052Z 2025-12-04T13:44:26.1447285Z [rank2]:[W1204 13:36:37.428091999 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1447455Z [rank1]:[W1204 13:36:38.020358630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1447670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1447927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1448107Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1448475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1448675Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1448779Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1448875Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1448971Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1448973Z 2025-12-04T13:44:26.1449205Z [rank1]:[W1204 13:36:38.022718659 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1449375Z [rank3]:[W1204 13:36:38.411821499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1449550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1449808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1449973Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1450339Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1450540Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1450665Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1450760Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1450858Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1450860Z 2025-12-04T13:44:26.1451105Z [rank3]:[W1204 13:36:38.413749167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1451289Z [rank2]:[W1204 13:36:38.428183590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1451463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1451719Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1451893Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1452262Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1452465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1452571Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1452667Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1452763Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1452765Z 2025-12-04T13:44:26.1452999Z [rank2]:[W1204 13:36:38.430723604 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1453168Z [rank1]:[W1204 13:36:39.022858559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1453343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1453601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1453764Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1454131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1454334Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1454439Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1454544Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1454643Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1454646Z 2025-12-04T13:44:26.1454892Z [rank1]:[W1204 13:36:39.024905424 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1455071Z [rank3]:[W1204 13:36:39.413865118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1455245Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1455501Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1455664Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1456040Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1456242Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1456348Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1456445Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1456543Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1456545Z 2025-12-04T13:44:26.1456778Z [rank3]:[W1204 13:36:39.415885974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1456948Z [rank2]:[W1204 13:36:39.430817116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1457123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1457378Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1457583Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1457951Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1458155Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1458258Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1458354Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1458463Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1458465Z 2025-12-04T13:44:26.1458706Z [rank2]:[W1204 13:36:39.433149235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1458897Z [rank1]:[W1204 13:36:40.025081634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1459086Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1459341Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1459503Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1459869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1460083Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1460189Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1460283Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1460380Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1460383Z 2025-12-04T13:44:26.1460618Z [rank1]:[W1204 13:36:40.027163538 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1460789Z [rank3]:[W1204 13:36:40.416024224 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1460966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1461222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1461385Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1461750Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1461953Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1462059Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1462155Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1462252Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1462253Z 2025-12-04T13:44:26.1462496Z [rank3]:[W1204 13:36:40.418048330 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1462667Z [rank2]:[W1204 13:36:40.433251986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1462851Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1463119Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1463283Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1463652Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1463870Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1463975Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1464071Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1464167Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1464169Z 2025-12-04T13:44:26.1464402Z [rank2]:[W1204 13:36:40.435600985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1464573Z [rank1]:[W1204 13:36:41.027320378 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1464748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1465007Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1465168Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1465538Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1465739Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1465844Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1465940Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1466036Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1466039Z 2025-12-04T13:44:26.1466282Z [rank1]:[W1204 13:36:41.029232847 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1466452Z [rank3]:[W1204 13:36:41.418239199 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1466634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1466898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1467085Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1467459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1467717Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1467823Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1467918Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1468015Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1468016Z 2025-12-04T13:44:26.1468250Z [rank3]:[W1204 13:36:41.420256485 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1468422Z [rank2]:[W1204 13:36:41.435757645 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1468597Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1468862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1469059Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1469479Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1469737Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1469900Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1470002Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1470102Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1470104Z 2025-12-04T13:44:26.1470338Z [rank2]:[W1204 13:36:41.438127663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1470523Z [rank1]:[W1204 13:36:42.029366358 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1470699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1470969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1471145Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1471531Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1471732Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1471850Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1471948Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1472045Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1472047Z 2025-12-04T13:44:26.1472281Z [rank1]:[W1204 13:36:42.031733366 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1472452Z [rank3]:[W1204 13:36:42.420408826 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1472626Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1472884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1473047Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1473418Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1473622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1473727Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1473831Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1473928Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1473931Z 2025-12-04T13:44:26.1474167Z [rank3]:[W1204 13:36:42.422476270 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1474339Z [rank2]:[W1204 13:36:42.438217305 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1474524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1474781Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1474956Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1475335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1475537Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1475643Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1475754Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1475852Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1475855Z 2025-12-04T13:44:26.1476089Z [rank2]:[W1204 13:36:42.440715320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1476267Z [rank1]:[W1204 13:36:43.031867247 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1476445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1476703Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1476868Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1477235Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1477441Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1477576Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1477675Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1477771Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1477775Z 2025-12-04T13:44:26.1478010Z [rank1]:[W1204 13:36:43.033557720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1478181Z [rank3]:[W1204 13:36:43.422611982 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1478359Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1478637Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1478802Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1479182Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1479402Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1479506Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1479606Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1479715Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1479717Z 2025-12-04T13:44:26.1479949Z [rank3]:[W1204 13:36:43.424748355 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1480120Z [rank2]:[W1204 13:36:43.440806382 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1480297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1480552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1480716Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1481091Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1481293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1481412Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1481509Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1481608Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1481611Z 2025-12-04T13:44:26.1481848Z [rank2]:[W1204 13:36:43.443313217 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1482020Z [rank1]:[W1204 13:36:44.033733200 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1482196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1482468Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1482635Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1483012Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1483225Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1483329Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1483427Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1483522Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1483550Z 2025-12-04T13:44:26.1483784Z [rank1]:[W1204 13:36:44.035660038 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1483955Z [rank3]:[W1204 13:36:44.424907725 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1484129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1484387Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1484550Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1484952Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1485160Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1485264Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1485360Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1485457Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1485459Z 2025-12-04T13:44:26.1485696Z [rank3]:[W1204 13:36:44.427176186 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1485865Z [rank2]:[W1204 13:36:44.443421659 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1486042Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1486297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1486473Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1486851Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1487076Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1487181Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1487276Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1487375Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1487377Z 2025-12-04T13:44:26.1487652Z [rank2]:[W1204 13:36:44.445747428 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1487840Z [rank1]:[W1204 13:36:45.035811429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1488015Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1488267Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1488431Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1488797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1489001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1489121Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1489217Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1489314Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1489316Z 2025-12-04T13:44:26.1489549Z [rank1]:[W1204 13:36:45.037086191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1489720Z [rank3]:[W1204 13:36:45.427327057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1489895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1490153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1490314Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1490710Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1490926Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1491043Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1491140Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1491235Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1491237Z 2025-12-04T13:44:26.1491471Z [rank3]:[W1204 13:36:45.428587179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1491655Z [rank2]:[W1204 13:36:45.445843060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1491832Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1492090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1492253Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1492621Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1492825Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1492930Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1493026Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1493123Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1493125Z 2025-12-04T13:44:26.1493359Z [rank2]:[W1204 13:36:45.446972046 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1493529Z [rank1]:[W1204 13:36:46.037300400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1493705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1493961Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1494126Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1494511Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1494715Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1494838Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1494944Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1495041Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1495043Z 2025-12-04T13:44:26.1495276Z [rank1]:[W1204 13:36:46.039740137 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1495447Z [rank3]:[W1204 13:36:46.428764370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1495637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1495895Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1496057Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1496427Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1496631Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1496735Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1496832Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1496927Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1496930Z 2025-12-04T13:44:26.1497163Z [rank3]:[W1204 13:36:46.430321876 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1497333Z [rank2]:[W1204 13:36:46.447053889 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1497546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1497803Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1497965Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1498346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1498549Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1498656Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1498763Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1498876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1498878Z 2025-12-04T13:44:26.1499111Z [rank2]:[W1204 13:36:46.449363428 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1499281Z [rank1]:[W1204 13:36:47.039903308 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1499456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1499727Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1499893Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1500258Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1500459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1500565Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1500675Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1500776Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1500778Z 2025-12-04T13:44:26.1501010Z [rank1]:[W1204 13:36:47.042423393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1501181Z [rank3]:[W1204 13:36:47.430505457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1501360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1501619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1501783Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1502158Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1502370Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1502476Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1502571Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1502677Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1502689Z 2025-12-04T13:44:26.1502925Z [rank3]:[W1204 13:36:47.432502783 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1503096Z [rank2]:[W1204 13:36:47.449494790 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1503272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1503533Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1503709Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1504084Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1504291Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1504397Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1504496Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1504593Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1504595Z 2025-12-04T13:44:26.1504833Z [rank2]:[W1204 13:36:47.451778340 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1505007Z [rank1]:[W1204 13:36:48.042596114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1505188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1505450Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1505622Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1505988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1506190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1506306Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1506402Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1506499Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1506501Z 2025-12-04T13:44:26.1506743Z [rank1]:[W1204 13:36:48.044975412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1506922Z [rank3]:[W1204 13:36:48.432625265 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1507097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1507356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1507572Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1507938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1508141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1508246Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1508343Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1508439Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1508441Z 2025-12-04T13:44:26.1508675Z [rank3]:[W1204 13:36:48.435096781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1508845Z [rank2]:[W1204 13:36:48.451867972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1509021Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1509277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1509441Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1509811Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1510014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1510120Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1510240Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1510337Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1510340Z 2025-12-04T13:44:26.1510587Z [rank2]:[W1204 13:36:48.454260660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1510772Z [rank1]:[W1204 13:36:49.045105014 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1510964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1511222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1511386Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1511767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1511971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1512077Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1512173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1512270Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1512273Z 2025-12-04T13:44:26.1512504Z [rank1]:[W1204 13:36:49.047543440 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1512675Z [rank3]:[W1204 13:36:49.435292871 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1512849Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1513105Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1513269Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1513637Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1513839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1513946Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1514043Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1514150Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1514152Z 2025-12-04T13:44:26.1514387Z [rank3]:[W1204 13:36:49.437678319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1514568Z [rank2]:[W1204 13:36:49.454388162 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1514754Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1515011Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1515174Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1515542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1515755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1515859Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1515953Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1516051Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1516053Z 2025-12-04T13:44:26.1516291Z [rank2]:[W1204 13:36:49.456604624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1516463Z [rank1]:[W1204 13:36:50.047714292 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1516639Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1516893Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1517057Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1517426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1517660Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1517766Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1517861Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1517958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1517960Z 2025-12-04T13:44:26.1518204Z [rank1]:[W1204 13:36:50.050248256 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1518376Z [rank3]:[W1204 13:36:50.437858820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1518563Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1518833Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1518997Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1519362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1519578Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1519683Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1519780Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1519875Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1519877Z 2025-12-04T13:44:26.1520135Z [rank3]:[W1204 13:36:50.439932175 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1520309Z [rank2]:[W1204 13:36:50.456723686 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1520484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1520742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1520906Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1521275Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1521477Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1521582Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1521677Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1521774Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1521775Z 2025-12-04T13:44:26.1522020Z [rank2]:[W1204 13:36:50.458825380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1522190Z [rank1]:[W1204 13:36:51.050418118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1522365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1522628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1522809Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1523177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1523400Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1523522Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1523619Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1523715Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1523717Z 2025-12-04T13:44:26.1523950Z [rank1]:[W1204 13:36:51.052284837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1524121Z [rank3]:[W1204 13:36:51.440056647 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1524297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1524554Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1524719Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1525088Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1525293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1525398Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1525494Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1525589Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1525591Z 2025-12-04T13:44:26.1525824Z [rank3]:[W1204 13:36:51.441308190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1526004Z [rank2]:[W1204 13:36:51.458930323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1526179Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1526445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1526616Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1526983Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1527190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1531130Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1531229Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1531325Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1531328Z 2025-12-04T13:44:26.1531583Z [rank2]:[W1204 13:36:51.460483289 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1531752Z [rank1]:[W1204 13:36:52.052440889 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1531927Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1532181Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1532346Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1532713Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1532917Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1533026Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1533122Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1533221Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1533223Z 2025-12-04T13:44:26.1533457Z [rank1]:[W1204 13:36:52.054889295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1533630Z [rank3]:[W1204 13:36:52.441470772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1533823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1534080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1534257Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1534636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1534839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1534959Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1535056Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1535152Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1535155Z 2025-12-04T13:44:26.1535392Z [rank3]:[W1204 13:36:52.443050457 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1535563Z [rank2]:[W1204 13:36:52.460609672 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1535738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1535992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1536156Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1536525Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1536727Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1536831Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1536928Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1537025Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1537028Z 2025-12-04T13:44:26.1537262Z [rank2]:[W1204 13:36:52.462958260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1537432Z [rank1]:[W1204 13:36:53.055084136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1537660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1537916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1538092Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1538472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1538674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1538810Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1538920Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1539029Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1539031Z 2025-12-04T13:44:26.1539264Z [rank1]:[W1204 13:36:53.057548782 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1539435Z [rank3]:[W1204 13:36:53.443224749 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1539609Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1539870Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1540037Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1540404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1540608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1540713Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1540809Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1540905Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1540907Z 2025-12-04T13:44:26.1541141Z [rank3]:[W1204 13:36:53.445175016 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1541313Z [rank2]:[W1204 13:36:53.463080963 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1541489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1541757Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1541921Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1542306Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1542528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1542634Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1542731Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1542827Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1542841Z 2025-12-04T13:44:26.1543076Z [rank2]:[W1204 13:36:53.465268565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1543246Z [rank1]:[W1204 13:36:54.057696815 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1543426Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1543680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1543844Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1544217Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1544420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1544525Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1544621Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1544717Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1544720Z 2025-12-04T13:44:26.1544951Z [rank1]:[W1204 13:36:54.060129941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1545123Z [rank3]:[W1204 13:36:54.445339698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1545297Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1545562Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1545727Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1546105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1546318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1546422Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1546520Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1546616Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1546619Z 2025-12-04T13:44:26.1546851Z [rank3]:[W1204 13:36:54.448436870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1547032Z [rank2]:[W1204 13:36:54.465377158 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1547207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1547461Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1547661Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1548031Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1548237Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1548342Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1548439Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1548537Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1548539Z 2025-12-04T13:44:26.1548777Z [rank2]:[W1204 13:36:54.467726377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1548949Z [rank1]:[W1204 13:36:55.060318333 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1549126Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1549379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1549559Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1549927Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1550140Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1550259Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1550353Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1550450Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1550452Z 2025-12-04T13:44:26.1550685Z [rank1]:[W1204 13:36:55.062548684 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1550872Z [rank3]:[W1204 13:36:55.448603612 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1551047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1551303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1551467Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1551832Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1552035Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1552140Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1552240Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1552335Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1552339Z 2025-12-04T13:44:26.1552571Z [rank3]:[W1204 13:36:55.450939201 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1552745Z [rank2]:[W1204 13:36:55.467875689 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1552922Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1553180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1553342Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1553720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1553933Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1554047Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1554144Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1554240Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1554242Z 2025-12-04T13:44:26.1554476Z [rank2]:[W1204 13:36:55.469359667 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1554645Z [rank1]:[W1204 13:36:56.062738816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1554830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1555085Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1555250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1555619Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1555822Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1555928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1556025Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1556123Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1556124Z 2025-12-04T13:44:26.1556358Z [rank1]:[W1204 13:36:56.065233641 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1556529Z [rank3]:[W1204 13:36:56.451118013 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1556705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1556960Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1557125Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1557533Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1557736Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1557862Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1557971Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1558072Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1558076Z 2025-12-04T13:44:26.1558311Z [rank3]:[W1204 13:36:56.453077720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1558482Z [rank2]:[W1204 13:36:56.469469990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1558670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1558926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1559090Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1559459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1559665Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1559772Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1559869Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1559966Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1559968Z 2025-12-04T13:44:26.1560201Z [rank2]:[W1204 13:36:56.470852790 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1560371Z [rank1]:[W1204 13:36:57.065385134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1560546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1560804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1560970Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1561336Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1561548Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1561655Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1561751Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1561860Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1561872Z 2025-12-04T13:44:26.1562107Z [rank1]:[W1204 13:36:57.067847790 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1562279Z [rank3]:[W1204 13:36:57.453248483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1562456Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1562722Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1562886Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1563253Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1563458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1563562Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1563663Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1563759Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1563762Z 2025-12-04T13:44:26.1563997Z [rank3]:[W1204 13:36:57.454499045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1564167Z [rank2]:[W1204 13:36:57.470957424 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1564342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1564597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1564762Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1565130Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1565344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1565449Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1565546Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1565642Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1565653Z 2025-12-04T13:44:26.1565895Z [rank2]:[W1204 13:36:57.472314254 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1566064Z [rank1]:[W1204 13:36:58.068043462 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1566243Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1566502Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1566679Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1567047Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1567248Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1567355Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1567450Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1567596Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1567598Z 2025-12-04T13:44:26.1567833Z [rank1]:[W1204 13:36:58.070523377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1568004Z [rank3]:[W1204 13:36:58.454672088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1568180Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1568440Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1568606Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1568974Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1569180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1569298Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1569395Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1569493Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1569495Z 2025-12-04T13:44:26.1569741Z [rank3]:[W1204 13:36:58.456263423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1569924Z [rank2]:[W1204 13:36:58.472449218 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1570098Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1570357Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1570533Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1570907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1571110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1571214Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1571312Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1571409Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1571412Z 2025-12-04T13:44:26.1571646Z [rank2]:[W1204 13:36:58.474006793 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1571816Z [rank1]:[W1204 13:36:59.070691090 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1571990Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1572244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1572409Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1572779Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1572983Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1573089Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1573202Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1573299Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1573301Z 2025-12-04T13:44:26.1573534Z [rank1]:[W1204 13:36:59.073188075 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1573715Z [rank3]:[W1204 13:36:59.456453455 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1573903Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1574159Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1574325Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1574701Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1574902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1575008Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1575104Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1575201Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1575203Z 2025-12-04T13:44:26.1575439Z [rank3]:[W1204 13:36:59.458118459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1575610Z [rank2]:[W1204 13:36:59.474197196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1575785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1576040Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1576204Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1576573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1576778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1576882Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1576978Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1577074Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1577087Z 2025-12-04T13:44:26.1577324Z [rank2]:[W1204 13:36:59.476094414 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1577546Z [rank1]:[W1204 13:37:00.073543914 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1577735Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1577991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1578155Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1578522Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1578735Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1578846Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1578940Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1579038Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1579040Z 2025-12-04T13:44:26.1579273Z [rank1]:[W1204 13:37:00.076236025 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1579445Z [rank3]:[W1204 13:37:00.458297471 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1579621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1579877Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1580040Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1580406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1580610Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1580715Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1580810Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1580906Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1580908Z 2025-12-04T13:44:26.1581151Z [rank3]:[W1204 13:37:00.460482613 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1581326Z [rank2]:[W1204 13:37:00.476201018 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1581509Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1581777Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1581939Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1582307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1582521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1582625Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1582723Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1582819Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1582820Z 2025-12-04T13:44:26.1583055Z [rank2]:[W1204 13:37:00.477328534 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1583224Z [rank1]:[W1204 13:37:01.076413348 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1583400Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1583657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1583821Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1584187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1584391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1584498Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1584593Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1584691Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1584693Z 2025-12-04T13:44:26.1584927Z [rank1]:[W1204 13:37:01.078014563 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1585105Z [rank3]:[W1204 13:37:01.460611237 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1585283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1585547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1585720Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1586089Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1586293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1586416Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1586512Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1586609Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1586611Z 2025-12-04T13:44:26.1586843Z [rank3]:[W1204 13:37:01.462703941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1587015Z [rank2]:[W1204 13:37:01.477637564 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1587188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1587445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1587661Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1588027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1588231Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1588336Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1588433Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1588529Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1588531Z 2025-12-04T13:44:26.1588764Z [rank2]:[W1204 13:37:01.478988284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1588945Z [rank1]:[W1204 13:37:02.078128037 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1589123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1589393Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1589566Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1589931Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1590133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1590255Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1590350Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1590448Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1590451Z 2025-12-04T13:44:26.1590685Z [rank1]:[W1204 13:37:02.080611142 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1590853Z [rank3]:[W1204 13:37:02.462896674 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1591031Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1591289Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1591453Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1591819Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1592020Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1592125Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1592220Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1592318Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1592321Z 2025-12-04T13:44:26.1592556Z [rank3]:[W1204 13:37:02.465404769 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1592728Z [rank2]:[W1204 13:37:02.479108038 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1592911Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1593168Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1593341Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1593720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1593926Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1594031Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1594137Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1594234Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1594236Z 2025-12-04T13:44:26.1594469Z [rank2]:[W1204 13:37:02.481041076 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1594639Z [rank1]:[W1204 13:37:03.080771436 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1594816Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1595073Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1595236Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1595602Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1595804Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1595911Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1596006Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1596106Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1596108Z 2025-12-04T13:44:26.1596344Z [rank1]:[W1204 13:37:03.083252722 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1596514Z [rank3]:[W1204 13:37:03.465549053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1596689Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1596956Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1597123Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1597542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1597757Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1597864Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1597959Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1598068Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1598069Z 2025-12-04T13:44:26.1598302Z [rank3]:[W1204 13:37:03.468035188 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1598473Z [rank2]:[W1204 13:37:03.481154830 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1598646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1598902Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1599067Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1599436Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1599641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1599744Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1599842Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1599937Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1599940Z 2025-12-04T13:44:26.1600175Z [rank2]:[W1204 13:37:03.483042939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1600345Z [rank1]:[W1204 13:37:04.083373016 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1600521Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1600793Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1600955Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1601338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1601555Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1601659Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1601755Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1601852Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1601864Z 2025-12-04T13:44:26.1602102Z [rank1]:[W1204 13:37:04.085874021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1602273Z [rank3]:[W1204 13:37:04.468164403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1602449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1602705Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1602869Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1603235Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1603437Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1603542Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1603637Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1603734Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1603736Z 2025-12-04T13:44:26.1603968Z [rank3]:[W1204 13:37:04.470129320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1604140Z [rank2]:[W1204 13:37:04.483125864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1604314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1604568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1604742Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1605119Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1605332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1611253Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1611361Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1611460Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1611464Z 2025-12-04T13:44:26.1611705Z [rank2]:[W1204 13:37:04.484387687 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1611914Z [rank1]:[W1204 13:37:05.086065364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1612094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1612356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1612521Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1612890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1613095Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1613203Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1613299Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1613395Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1613398Z 2025-12-04T13:44:26.1613636Z [rank1]:[W1204 13:37:05.088669037 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1613808Z [rank3]:[W1204 13:37:05.470334912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1613985Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1614243Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1614408Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1614791Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1615009Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1615125Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1615221Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1615317Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1615319Z 2025-12-04T13:44:26.1615553Z [rank3]:[W1204 13:37:05.472371638 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1615733Z [rank2]:[W1204 13:37:05.484522701 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1615909Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1616169Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1616331Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1616702Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1616907Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1617012Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1617108Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1617205Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1617207Z 2025-12-04T13:44:26.1617441Z [rank2]:[W1204 13:37:05.486198494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1617660Z [rank1]:[W1204 13:37:06.088808101 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1617839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1618096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1618259Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1618641Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1618843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1618964Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1619071Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1619167Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1619169Z 2025-12-04T13:44:26.1619403Z [rank1]:[W1204 13:37:06.090812548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1619572Z [rank3]:[W1204 13:37:06.472576091 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1619762Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1620018Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1620182Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1620554Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1620756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1620862Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1620958Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1621055Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1621057Z 2025-12-04T13:44:26.1621288Z [rank3]:[W1204 13:37:06.474876000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1621459Z [rank2]:[W1204 13:37:06.486311679 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1621633Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1621890Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1622055Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1622429Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1622632Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1622738Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1622844Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1622954Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1622956Z 2025-12-04T13:44:26.1623193Z [rank2]:[W1204 13:37:06.487778467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1623363Z [rank1]:[W1204 13:37:07.090945802 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1623537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1623804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1623965Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1624330Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1624531Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1624637Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1624731Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1624831Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1624833Z 2025-12-04T13:44:26.1625068Z [rank1]:[W1204 13:37:07.093435697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1625237Z [rank3]:[W1204 13:37:07.475026945 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1625412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1625669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1625833Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1626202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1626415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1626520Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1626616Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1626730Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1626742Z 2025-12-04T13:44:26.1626977Z [rank3]:[W1204 13:37:07.477508940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1627149Z [rank2]:[W1204 13:37:07.487861963 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1627324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1627632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1627812Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1628180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1628384Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1628487Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1628584Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1628680Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1628682Z 2025-12-04T13:44:26.1628917Z [rank2]:[W1204 13:37:07.489114325 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1629088Z [rank1]:[W1204 13:37:08.093560253 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1629264Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1629521Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1629683Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1630049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1630250Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1630368Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1630462Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1630560Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1630562Z 2025-12-04T13:44:26.1630809Z [rank1]:[W1204 13:37:08.096072227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1630989Z [rank3]:[W1204 13:37:08.477636035 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1631165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1631423Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1631598Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1631963Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1632166Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1632271Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1632366Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1632463Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1632465Z 2025-12-04T13:44:26.1632698Z [rank3]:[W1204 13:37:08.479644131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1632868Z [rank2]:[W1204 13:37:08.489196441 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1633041Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1633298Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1633462Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1633836Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1634039Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1634143Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1634250Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1634348Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1634350Z 2025-12-04T13:44:26.1634593Z [rank2]:[W1204 13:37:08.490377885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1634774Z [rank1]:[W1204 13:37:09.096208012 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1634948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1635204Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1635364Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1635748Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1635949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1636053Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1636149Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1636244Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1636246Z 2025-12-04T13:44:26.1636481Z [rank1]:[W1204 13:37:09.098968062 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1636650Z [rank3]:[W1204 13:37:09.479832145 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1636826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1637082Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1637244Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1637644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1637847Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1637952Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1638047Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1638157Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1638159Z 2025-12-04T13:44:26.1638392Z [rank3]:[W1204 13:37:09.482080195 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1638577Z [rank2]:[W1204 13:37:09.490486151 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1638765Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1639022Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1639186Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1639555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1639771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1639874Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1639972Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1640068Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1640071Z 2025-12-04T13:44:26.1640306Z [rank2]:[W1204 13:37:09.491704784 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1640478Z [rank1]:[W1204 13:37:10.099117406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1640654Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1640909Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1641070Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1641438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1641641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1641746Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1641845Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1641940Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1641942Z 2025-12-04T13:44:26.1642195Z [rank1]:[W1204 13:37:10.101563823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1642367Z [rank3]:[W1204 13:37:10.482219210 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1642553Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1642818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1642981Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1643348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1643559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1643665Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1643758Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1643855Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1643856Z 2025-12-04T13:44:26.1644089Z [rank3]:[W1204 13:37:10.484390053 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1644259Z [rank2]:[W1204 13:37:10.491783891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1644435Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1644692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1644855Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1645221Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1645423Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1645529Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1645625Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1645721Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1645722Z 2025-12-04T13:44:26.1645968Z [rank2]:[W1204 13:37:10.493290628 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1646139Z [rank1]:[W1204 13:37:11.101708338 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1646313Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1646580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1646753Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1647119Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1647329Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1647434Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1647570Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1647666Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1647668Z 2025-12-04T13:44:26.1647901Z [rank1]:[W1204 13:37:11.104085406 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1648073Z [rank3]:[W1204 13:37:11.484525448 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1648250Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1648506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1648671Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1649044Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1649244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1649349Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1649444Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1649541Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1649543Z 2025-12-04T13:44:26.1649775Z [rank3]:[W1204 13:37:11.486544384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1649957Z [rank2]:[W1204 13:37:11.493376404 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1650131Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1650402Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1650578Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1650947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1651151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1651269Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1651365Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1651462Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1651463Z 2025-12-04T13:44:26.1651697Z [rank2]:[W1204 13:37:11.494560588 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1651867Z [rank1]:[W1204 13:37:12.104218611 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1652041Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1652298Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1652461Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1652827Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1653028Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1653136Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1653233Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1653329Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1653332Z 2025-12-04T13:44:26.1653570Z [rank1]:[W1204 13:37:12.106693797 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1653740Z [rank3]:[W1204 13:37:12.486867715 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1653924Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1654179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1654352Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1654728Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1654928Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1655033Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1655143Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1655240Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1655243Z 2025-12-04T13:44:26.1655476Z [rank3]:[W1204 13:37:12.489168934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1655645Z [rank2]:[W1204 13:37:12.494645624 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1655819Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1656073Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1656237Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1656605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1656807Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1656911Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1657008Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1657103Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1657107Z 2025-12-04T13:44:26.1657340Z [rank2]:[W1204 13:37:12.495870267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1657559Z [rank1]:[W1204 13:37:13.106836882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1657736Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1658004Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1658167Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1658543Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1658756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1658860Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1658956Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1659064Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1659066Z 2025-12-04T13:44:26.1659300Z [rank1]:[W1204 13:37:13.109299238 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1659470Z [rank3]:[W1204 13:37:13.489348909 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1659645Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1659904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1660068Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1660436Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1660638Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1660743Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1660839Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1660935Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1660937Z 2025-12-04T13:44:26.1661170Z [rank3]:[W1204 13:37:13.491511192 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1661342Z [rank2]:[W1204 13:37:13.495977594 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1661519Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1661784Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1661950Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1662331Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1662541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1662645Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1662741Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1662837Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1662850Z 2025-12-04T13:44:26.1663084Z [rank2]:[W1204 13:37:13.497208357 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1663256Z [rank1]:[W1204 13:37:14.109617959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1663429Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1663686Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1663848Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1664220Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1664422Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1664526Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1664621Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1664717Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1664719Z 2025-12-04T13:44:26.1664953Z [rank1]:[W1204 13:37:14.112210652 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1665122Z [rank3]:[W1204 13:37:14.491635967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1665298Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1665553Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1665728Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1666107Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1666320Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1666425Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1666520Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1666617Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1666619Z 2025-12-04T13:44:26.1666851Z [rank3]:[W1204 13:37:14.493818859 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1667032Z [rank2]:[W1204 13:37:14.497294513 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1667207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1667460Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1667659Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1668027Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1668234Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1668339Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1668435Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1668531Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1668534Z 2025-12-04T13:44:26.1668767Z [rank2]:[W1204 13:37:14.498739722 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1668939Z [rank1]:[W1204 13:37:15.112348458 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1669113Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1669368Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1669531Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1669911Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1670126Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1670249Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1670345Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1670441Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1670443Z 2025-12-04T13:44:26.1670677Z [rank1]:[W1204 13:37:15.114837164 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1670861Z [rank3]:[W1204 13:37:15.493966345 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1671036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1671294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1671456Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1671824Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1672025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1672130Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1672225Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1672321Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1672322Z 2025-12-04T13:44:26.1672555Z [rank3]:[W1204 13:37:15.495732986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1672728Z [rank2]:[W1204 13:37:15.498824008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1672906Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1673161Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1673324Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1673699Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1673902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1674015Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1674120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1674217Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1674220Z 2025-12-04T13:44:26.1674453Z [rank2]:[W1204 13:37:15.500174089 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1674624Z [rank1]:[W1204 13:37:16.115027898 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1674808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1675069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1675232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1675606Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1675808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1675913Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1676011Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1676107Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1676109Z 2025-12-04T13:44:26.1676342Z [rank1]:[W1204 13:37:16.117534573 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1676512Z [rank3]:[W1204 13:37:16.496057488 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1676688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1676943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1677108Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1677532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1677733Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1677840Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1677947Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1678057Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1678059Z 2025-12-04T13:44:26.1678290Z [rank3]:[W1204 13:37:16.498432346 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1678464Z [rank2]:[W1204 13:37:16.500247676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1678642Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1678912Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1679077Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1679448Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1679653Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1679758Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1679856Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1679954Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1679956Z 2025-12-04T13:44:26.1680191Z [rank2]:[W1204 13:37:16.501389761 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1680362Z [rank1]:[W1204 13:37:17.117671349 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1680535Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1680792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1680955Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1681322Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1681539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1681644Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1681740Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1681847Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1681858Z 2025-12-04T13:44:26.1682092Z [rank1]:[W1204 13:37:17.120195484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1682263Z [rank3]:[W1204 13:37:17.498608931 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1682440Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1682696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1682875Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1683245Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1683447Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1683553Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1683649Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1683746Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1683749Z 2025-12-04T13:44:26.1683981Z [rank3]:[W1204 13:37:17.500262375 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1684151Z [rank2]:[W1204 13:37:17.501474768 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1684327Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1684580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1684745Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1685111Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1685323Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1685427Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1685525Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1685621Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1685623Z 2025-12-04T13:44:26.1685872Z [rank2]:[W1204 13:37:17.502607133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1686052Z [rank1]:[W1204 13:37:18.120304910 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1686226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1686481Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1686654Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1687022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1687225Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1687330Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1687426Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1687559Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1687561Z 2025-12-04T13:44:26.1687795Z [rank1]:[W1204 13:37:18.122945943 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1687966Z [rank3]:[W1204 13:37:18.500456690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1688144Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1688401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1688564Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1688934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1689135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1689239Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1689347Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1689445Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1689448Z 2025-12-04T13:44:26.1689693Z [rank3]:[W1204 13:37:18.502434056 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1689878Z [rank2]:[W1204 13:37:18.502709180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1690052Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1690308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1690472Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1690852Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1691055Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1691161Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1691257Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1691354Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1691357Z 2025-12-04T13:44:26.1691589Z [rank2]:[W1204 13:37:18.503901974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1691760Z [rank1]:[W1204 13:37:19.123121128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1691934Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1692188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1692350Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1692720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1692921Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1693024Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1693120Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1693225Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1693227Z 2025-12-04T13:44:26.1693462Z [rank1]:[W1204 13:37:19.124976677 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1693643Z [rank3]:[W1204 13:37:19.502563323 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1693827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1694083Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1694245Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1694616Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1694828Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1694932Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1695029Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1695127Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1695129Z 2025-12-04T13:44:26.1695364Z [rank3]:[W1204 13:37:19.504534990 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1695535Z [rank2]:[W1204 13:37:19.504033490 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1695715Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1695969Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1696133Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1696499Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1696704Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1696811Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1696907Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1697004Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1697006Z 2025-12-04T13:44:26.1697247Z [rank2]:[W1204 13:37:19.506305601 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1697428Z [rank1]:[W1204 13:37:20.125145293 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1697646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1697921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1698081Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1698456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1698673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1698779Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1698882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1698978Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1698979Z 2025-12-04T13:44:26.1699221Z [rank1]:[W1204 13:37:20.126938824 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1699395Z [rank3]:[W1204 13:37:20.504688066 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1699573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1699836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1700003Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1700374Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1700580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1700694Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1700791Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1700892Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1700894Z 2025-12-04T13:44:26.1701147Z [rank3]:[W1204 13:37:20.506525665 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1701318Z [rank2]:[W1204 13:37:20.506400018 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1701505Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1701771Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1701934Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1702301Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1702512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1702617Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1702714Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1702810Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1702812Z 2025-12-04T13:44:26.1703044Z [rank2]:[W1204 13:37:20.508148600 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1703214Z [rank1]:[W1204 13:37:21.127130629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1703391Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1703646Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1703808Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1704174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1704376Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1704480Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1704577Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1704672Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1704674Z 2025-12-04T13:44:26.1704908Z [rank1]:[W1204 13:37:21.128802342 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1705086Z [rank3]:[W1204 13:37:21.506644742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1705263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1705531Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1705704Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1706073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1706276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1706391Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1706486Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1706584Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1706586Z 2025-12-04T13:44:26.1706818Z [rank3]:[W1204 13:37:21.508740086 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1706988Z [rank2]:[W1204 13:37:21.508239767 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1707163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1707420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1707622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1707991Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1708195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1708300Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1708395Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1708492Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1708494Z 2025-12-04T13:44:26.1708726Z [rank2]:[W1204 13:37:21.510032738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1708896Z [rank1]:[W1204 13:37:22.129008497 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1709083Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1709342Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1709516Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1709897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1710103Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1710219Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1710314Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1710410Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1710413Z 2025-12-04T13:44:26.1710646Z [rank1]:[W1204 13:37:22.131233069 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1710816Z [rank3]:[W1204 13:37:22.508854974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1710993Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1711249Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1711412Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1711779Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1711981Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1712087Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1712183Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1712280Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1712282Z 2025-12-04T13:44:26.1712519Z [rank3]:[W1204 13:37:22.510409889 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1712688Z [rank2]:[W1204 13:37:22.510108816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1712872Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1713128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1713302Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1713678Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1713880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1713986Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1714080Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1714194Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1714195Z 2025-12-04T13:44:26.1714430Z [rank2]:[W1204 13:37:22.511869987 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1714600Z [rank1]:[W1204 13:37:23.131412394 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1714774Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1715032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1715195Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1715560Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1715763Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1715867Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1715965Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1716061Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1716063Z 2025-12-04T13:44:26.1716299Z [rank1]:[W1204 13:37:23.133826492 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1716471Z [rank3]:[W1204 13:37:23.510508517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1716647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1716912Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1717075Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1717450Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1717686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1717791Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1717886Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1717982Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1717996Z 2025-12-04T13:44:26.1718231Z [rank3]:[W1204 13:37:23.512457064 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1718402Z [rank2]:[W1204 13:37:23.511944866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1718577Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1718838Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1719002Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1719371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1719574Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1719679Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1719776Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1719873Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1719877Z 2025-12-04T13:44:26.1720111Z [rank2]:[W1204 13:37:23.513667908 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1720283Z [rank1]:[W1204 13:37:24.133988328 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1720457Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1720726Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1720890Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1721279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1721491Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1721595Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1721691Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1721787Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1721789Z 2025-12-04T13:44:26.1722022Z [rank1]:[W1204 13:37:24.136424894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1722203Z [rank3]:[W1204 13:37:24.512587581 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1722378Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1722635Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1722798Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1723175Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1723377Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1723483Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1723577Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1723675Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1723677Z 2025-12-04T13:44:26.1723910Z [rank3]:[W1204 13:37:24.514592408 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1724082Z [rank2]:[W1204 13:37:24.513901313 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1724259Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1724513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1724690Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1725056Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1725274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1725389Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1725484Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1725581Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1725583Z 2025-12-04T13:44:26.1725815Z [rank2]:[W1204 13:37:24.515945488 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1725998Z [rank1]:[W1204 13:37:25.136626880 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1726172Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1726429Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1726592Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1726958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1727163Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1727268Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1727364Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1727459Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1727461Z 2025-12-04T13:44:26.1727778Z [rank1]:[W1204 13:37:25.138535958 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1727949Z [rank3]:[W1204 13:37:25.514755504 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1728124Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1728383Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1728545Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1728924Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1729138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1729260Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1729356Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1729452Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1729454Z 2025-12-04T13:44:26.1729691Z [rank3]:[W1204 13:37:25.516712571 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1729861Z [rank2]:[W1204 13:37:25.516043466 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1730048Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1730303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1730469Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1730838Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1731040Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1731146Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1731242Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1731339Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1731341Z 2025-12-04T13:44:26.1731574Z [rank2]:[W1204 13:37:25.517328228 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1731746Z [rank1]:[W1204 13:37:26.138712535 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1731923Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1732181Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1732345Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1732721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1732923Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1733036Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1733144Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1733240Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1733241Z 2025-12-04T13:44:26.1733475Z [rank1]:[W1204 13:37:26.140853558 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1733647Z [rank2]:[W1204 13:37:26.517421046 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1733832Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1734091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1734255Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1734622Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1734823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1734929Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1735027Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1735126Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1735128Z 2025-12-04T13:44:26.1735299Z [rank3]:[W1204 13:37:26.517422746 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1735472Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1735728Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1735892Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1736263Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1736465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1736579Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1736676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1736771Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1736773Z 2025-12-04T13:44:26.1737018Z [rank2]:[W1204 13:37:26.519446431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1737256Z [rank3]:[W1204 13:37:26.519446751 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1737427Z [rank1]:[W1204 13:37:27.141029064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1737648Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1737921Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1738085Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1738456Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1738662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1738767Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1738864Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1738961Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1738964Z 2025-12-04T13:44:26.1739203Z [rank1]:[W1204 13:37:27.142872774 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1739375Z [rank3]:[W1204 13:37:27.519565669 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1739549Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1739807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1739970Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1740340Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1740555Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1740661Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1740758Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1740854Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1740870Z 2025-12-04T13:44:26.1741115Z [rank3]:[W1204 13:37:27.520785522 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1741285Z [rank2]:[W1204 13:37:27.519565689 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1741461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1741717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1741896Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1742264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1742466Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1742573Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1742669Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1742768Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1742770Z 2025-12-04T13:44:26.1743006Z [rank2]:[W1204 13:37:27.521068506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1743178Z [rank1]:[W1204 13:37:28.143024351 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1743351Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1743609Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1743773Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1744139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1744341Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1744455Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1744552Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1744648Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1744650Z 2025-12-04T13:44:26.1744902Z [rank1]:[W1204 13:37:28.145081666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1745084Z [rank2]:[W1204 13:37:28.521179004 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1745257Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1745516Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1745694Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1746064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1746266Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1746372Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1746469Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1746565Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1746568Z 2025-12-04T13:44:26.1746739Z [rank3]:[W1204 13:37:28.520954369 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1746912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1747171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1747333Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1747754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1747959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1748063Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1748159Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1748255Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1748257Z 2025-12-04T13:44:26.1748510Z [rank2]:[W1204 13:37:28.522816078 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1748740Z [rank3]:[W1204 13:37:28.522817748 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1748925Z [rank1]:[W1204 13:37:29.145233573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1749113Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1749369Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1749534Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1749915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1750119Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1750223Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1750322Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1750419Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1750423Z 2025-12-04T13:44:26.1750658Z [rank1]:[W1204 13:37:29.147168861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1750832Z [rank2]:[W1204 13:37:29.523394306 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1751007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1751262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1751424Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1751796Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1752001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1752104Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1752200Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1752305Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1752307Z 2025-12-04T13:44:26.1752542Z [rank2]:[W1204 13:37:29.525077800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1752727Z [rank3]:[W1204 13:37:29.523399296 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1752912Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1753168Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1753332Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1753699Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1753914Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1754020Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1754115Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1754213Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1754215Z 2025-12-04T13:44:26.1754448Z [rank3]:[W1204 13:37:29.525932151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1754620Z [rank1]:[W1204 13:37:30.147334818 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1754795Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1755052Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1755214Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1755580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1755784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1755890Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1755986Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1756084Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1756087Z 2025-12-04T13:44:26.1756333Z [rank1]:[W1204 13:37:30.149274855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1756504Z [rank2]:[W1204 13:37:30.525197448 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1756688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1756954Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1757117Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1757528Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1757745Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1757849Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1757945Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1758041Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1758043Z 2025-12-04T13:44:26.1758280Z [rank2]:[W1204 13:37:30.527366440 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1758451Z [rank3]:[W1204 13:37:30.526045239 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1758626Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1758881Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1759045Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1759415Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1759617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1759722Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1759818Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1759913Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1759915Z 2025-12-04T13:44:26.1760148Z [rank3]:[W1204 13:37:30.528194632 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1760336Z [rank1]:[W1204 13:37:31.149443783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1760513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1760782Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1760963Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1761332Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1761535Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1761650Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1761746Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1761842Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1761846Z 2025-12-04T13:44:26.1762078Z [rank1]:[W1204 13:37:31.150809523 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1762248Z [rank2]:[W1204 13:37:31.527441050 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1762423Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1762682Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1762845Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1763215Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1763417Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1763522Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1763618Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1763715Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1763716Z 2025-12-04T13:44:26.1763950Z [rank2]:[W1204 13:37:31.529216131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1764132Z [rank3]:[W1204 13:37:31.528313071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1764309Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1764574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1764747Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1765117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1765318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1765434Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1765528Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1765625Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1765629Z 2025-12-04T13:44:26.1765862Z [rank3]:[W1204 13:37:31.530483533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1766032Z [rank1]:[W1204 13:37:32.150959711 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1766207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1766464Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1766627Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1766995Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1767200Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1767304Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1767401Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1767534Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1767538Z 2025-12-04T13:44:26.1767771Z [rank1]:[W1204 13:37:32.153569494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1767943Z [rank2]:[W1204 13:37:32.529376619 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1768130Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1768387Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1768564Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1768947Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1769151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1769255Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1769367Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1769463Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1769464Z 2025-12-04T13:44:26.1769636Z [rank3]:[W1204 13:37:32.530593032 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1769809Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1770066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1770228Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1770597Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1770799Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1770903Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1770999Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1771095Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1771099Z 2025-12-04T13:44:26.1771339Z [rank2]:[W1204 13:37:32.531753427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1771571Z [rank3]:[W1204 13:37:32.531756917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1771744Z [rank1]:[W1204 13:37:33.153715042 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1771918Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1772183Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1772347Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1772721Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1772932Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1773037Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1773134Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1773245Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1773248Z 2025-12-04T13:44:26.1773480Z [rank1]:[W1204 13:37:33.156129519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1773652Z [rank3]:[W1204 13:37:33.531909354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1773826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1774082Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1774245Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1774613Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1774817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1774921Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1775017Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1775112Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1775115Z 2025-12-04T13:44:26.1775349Z [rank3]:[W1204 13:37:33.533161837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1775521Z [rank2]:[W1204 13:37:33.531908544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1775697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1775961Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1776126Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1776510Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1776720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1776825Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1776922Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1777019Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1777031Z 2025-12-04T13:44:26.1777265Z [rank2]:[W1204 13:37:33.534068057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1777438Z [rank1]:[W1204 13:37:34.156299096 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1777670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1777923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1778087Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1778454Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1778656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1778760Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1778856Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1778953Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1778955Z 2025-12-04T13:44:26.1779189Z [rank1]:[W1204 13:37:34.158722223 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1779361Z [rank3]:[W1204 13:37:34.533323915 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1779537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1779798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1779975Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1780355Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1780569Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1780673Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1780769Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1780864Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1780867Z 2025-12-04T13:44:26.1781103Z [rank3]:[W1204 13:37:34.534547188 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1781285Z [rank2]:[W1204 13:37:34.534177736 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1781461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1781716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1781882Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1782253Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1782457Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1782564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1782658Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1782754Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1782756Z 2025-12-04T13:44:26.1782988Z [rank2]:[W1204 13:37:34.536285210 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1783160Z [rank1]:[W1204 13:37:35.158882641 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1783336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1783591Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1783756Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1784134Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1784350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1784466Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1784562Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1784658Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1784660Z 2025-12-04T13:44:26.1784892Z [rank1]:[W1204 13:37:35.161090982 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1785073Z [rank3]:[W1204 13:37:35.534678607 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1785248Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1785504Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1785667Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1786040Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1786245Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1786349Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1786444Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1786541Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1786543Z 2025-12-04T13:44:26.1786778Z [rank3]:[W1204 13:37:35.537203961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1786947Z [rank2]:[W1204 13:37:35.536399179 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1787122Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1787377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1787575Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1787958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1788162Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1788279Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1788393Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1788490Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1788492Z 2025-12-04T13:44:26.1788724Z [rank2]:[W1204 13:37:35.538829356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1788895Z [rank1]:[W1204 13:37:36.161256550 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1789083Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1789337Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1789502Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1789867Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1790069Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1790177Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1790276Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1790378Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1790379Z 2025-12-04T13:44:26.1790612Z [rank1]:[W1204 13:37:36.163339545 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1790784Z [rank3]:[W1204 13:37:36.537336160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1790959Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1791218Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1791380Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1791756Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1791959Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1792064Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1792169Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1792275Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1792276Z 2025-12-04T13:44:26.1792511Z [rank3]:[W1204 13:37:36.539542651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1792681Z [rank2]:[W1204 13:37:36.538925505 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1792861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1793130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1793294Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1793662Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1793864Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1793969Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1794064Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1794163Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1794165Z 2025-12-04T13:44:26.1794409Z [rank2]:[W1204 13:37:36.541299133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1794580Z [rank1]:[W1204 13:37:37.163721348 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1794756Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1795011Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1795175Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1795541Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1795755Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1795864Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1795959Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1796066Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1796079Z 2025-12-04T13:44:26.1796311Z [rank1]:[W1204 13:37:37.165954879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1796481Z [rank3]:[W1204 13:37:37.539698380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1796656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1796917Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1797090Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1797460Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1797707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1797811Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1797908Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1798004Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1798005Z 2025-12-04T13:44:26.1798241Z [rank3]:[W1204 13:37:37.541743785 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1798411Z [rank2]:[W1204 13:37:37.541411542 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1798585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1798845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1799011Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1799382Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1799584Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1799703Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1799799Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1799897Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1799899Z 2025-12-04T13:44:26.1800143Z [rank2]:[W1204 13:37:37.543775460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1800325Z [rank1]:[W1204 13:37:38.166131507 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1800500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1800755Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1800936Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1801306Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1801510Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1801617Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1801712Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1801811Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1801813Z 2025-12-04T13:44:26.1802047Z [rank1]:[W1204 13:37:38.168436117 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1802219Z [rank3]:[W1204 13:37:38.541869024 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1802393Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1802651Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1802814Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1803187Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1803392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1803497Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1803602Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1803698Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1803701Z 2025-12-04T13:44:26.1803951Z [rank3]:[W1204 13:37:38.543558577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1804131Z [rank2]:[W1204 13:37:38.543977248 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1804307Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1804564Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1804726Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1805105Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1805307Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1805414Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1805509Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1805608Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1805610Z 2025-12-04T13:44:26.1805845Z [rank2]:[W1204 13:37:38.546306317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1806016Z [rank1]:[W1204 13:37:39.168624184 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1806193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1806449Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1806613Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1806981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1807184Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1807288Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1807384Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1807539Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1807541Z 2025-12-04T13:44:26.1807777Z [rank1]:[W1204 13:37:39.170798067 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1807962Z [rank3]:[W1204 13:37:39.543726805 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1808150Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1808407Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1808571Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1808941Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1809158Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1809263Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1809359Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1809454Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1809456Z 2025-12-04T13:44:26.1809690Z [rank3]:[W1204 13:37:39.546275810 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1809863Z [rank2]:[W1204 13:37:39.546425676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1810042Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1810299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1810462Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1810829Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1811032Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1811138Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1811233Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1811329Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1811331Z 2025-12-04T13:44:26.1811574Z [rank2]:[W1204 13:37:39.548435912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1811746Z [rank1]:[W1204 13:37:40.170990805 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1811931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1812198Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1812360Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1812726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1812938Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1813044Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1813138Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1813236Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1813238Z 2025-12-04T13:44:26.1813472Z [rank1]:[W1204 13:37:40.173127068 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1813665Z [rank3]:[W1204 13:37:40.546440148 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1814091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1814580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1815058Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1815628Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1816237Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1816592Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1816838Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1817084Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1817226Z 2025-12-04T13:44:26.1817570Z [rank3]:[W1204 13:37:40.549090480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1818088Z [rank2]:[W1204 13:37:40.548559092 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1818550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1819050Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1819516Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1820090Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1820714Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1821064Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1821306Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1821539Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1821676Z 2025-12-04T13:44:26.1821912Z [rank2]:[W1204 13:37:40.550954799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1822355Z [rank1]:[W1204 13:37:41.173315086 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1822734Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1823200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1823655Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1824219Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1824820Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1825163Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1825401Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1825632Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1825770Z 2025-12-04T13:44:26.1826003Z [rank1]:[W1204 13:37:41.175781872 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1826459Z [rank3]:[W1204 13:37:41.549270438 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1826834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1827309Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1827829Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1828394Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1828999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1829357Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1829594Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1829824Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1829959Z 2025-12-04T13:44:26.1830198Z [rank3]:[W1204 13:37:41.551223405 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1830634Z [rank2]:[W1204 13:37:41.551089858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1831008Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1831474Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1831924Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1832488Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1833091Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1833434Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1833674Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1833907Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1834046Z 2025-12-04T13:44:26.1834283Z [rank2]:[W1204 13:37:41.553381048 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1834718Z [rank1]:[W1204 13:37:42.175963320 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1835107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1835568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1836035Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1836613Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1837211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1837597Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1837850Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1838081Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1838218Z 2025-12-04T13:44:26.1838452Z [rank1]:[W1204 13:37:42.178486475 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1838890Z [rank3]:[W1204 13:37:42.551361605 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1839267Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1839730Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1840185Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1840749Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1841350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1841691Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1841930Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1842163Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1842300Z 2025-12-04T13:44:26.1842535Z [rank3]:[W1204 13:37:42.553542547 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1842973Z [rank2]:[W1204 13:37:42.553479648 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1843348Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1843824Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1844278Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1844857Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1845470Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1845814Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1846051Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1846293Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1846431Z 2025-12-04T13:44:26.1846667Z [rank2]:[W1204 13:37:42.556693688 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1847106Z [rank1]:[W1204 13:37:43.178680403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1847528Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1847988Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1848438Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1849003Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1849600Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1849941Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1850177Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1850410Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1850547Z 2025-12-04T13:44:26.1850781Z [rank1]:[W1204 13:37:43.180769927 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1851217Z [rank3]:[W1204 13:37:43.553671457 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1851595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1852077Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1852528Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1853106Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1853725Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1854065Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1854304Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1854533Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1854688Z 2025-12-04T13:44:26.1854923Z [rank3]:[W1204 13:37:43.555984586 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1855358Z [rank2]:[W1204 13:37:43.556810198 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1860939Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1861425Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1861886Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1862462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1863067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1863414Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1863662Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1863898Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1864036Z 2025-12-04T13:44:26.1864276Z [rank2]:[W1204 13:37:43.559194456 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1864721Z [rank1]:[W1204 13:37:44.180943676 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1865104Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1865576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1866073Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1866656Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1867271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1867654Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1867893Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1868123Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1868260Z 2025-12-04T13:44:26.1868499Z [rank1]:[W1204 13:37:44.182893903 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1868957Z [rank3]:[W1204 13:37:44.556118896 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1869337Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1869798Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1870251Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1870818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1871421Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1871763Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1872001Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1872233Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1872371Z 2025-12-04T13:44:26.1872606Z [rank3]:[W1204 13:37:44.558333607 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1873042Z [rank2]:[W1204 13:37:44.559297926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1873421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1873885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1874338Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1874920Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1875533Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1875887Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1876125Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1876355Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1876490Z 2025-12-04T13:44:26.1876727Z [rank2]:[W1204 13:37:44.561702333 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1877174Z [rank1]:[W1204 13:37:45.183056993 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1877576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1878039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1878488Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1879052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1879652Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1879999Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1880242Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1880483Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1880620Z 2025-12-04T13:44:26.1880862Z [rank1]:[W1204 13:37:45.184902612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1881308Z [rank3]:[W1204 13:37:45.558446828 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1881690Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1882165Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1882627Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1883210Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1883811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1884178Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1884438Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1884681Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1884826Z 2025-12-04T13:44:26.1885065Z [rank3]:[W1204 13:37:45.559915075 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1885503Z [rank2]:[W1204 13:37:45.561785554 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1885901Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1886371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1886837Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1887412Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1888060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1888402Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1888650Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1888886Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1889026Z 2025-12-04T13:44:26.1889270Z [rank2]:[W1204 13:37:45.562890220 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1889715Z [rank1]:[W1204 13:37:46.185067721 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1890099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1890574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1891037Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1891624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1892231Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1892578Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1892830Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1893081Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1893216Z 2025-12-04T13:44:26.1893451Z [rank1]:[W1204 13:37:46.186601158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1893886Z [rank3]:[W1204 13:37:46.560110564 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1894262Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1894738Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1895192Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1895759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1896354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1896694Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1896932Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1897162Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1897299Z 2025-12-04T13:44:26.1897569Z [rank3]:[W1204 13:37:46.562289196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1898008Z [rank2]:[W1204 13:37:46.562985981 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1898384Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1898845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1899297Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1899863Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1900477Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1900819Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1901057Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1901302Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1901451Z 2025-12-04T13:44:26.1901686Z [rank2]:[W1204 13:37:46.565353739 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1902122Z [rank1]:[W1204 13:37:47.186691329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1902498Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1902957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1903426Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1903988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1904588Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1904928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1905167Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1905396Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1905533Z 2025-12-04T13:44:26.1905769Z [rank1]:[W1204 13:37:47.188104328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1906205Z [rank3]:[W1204 13:37:47.562469986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1906584Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1907043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1907536Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1908099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1908713Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1909053Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1909292Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1909521Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1909657Z 2025-12-04T13:44:26.1909904Z [rank3]:[W1204 13:37:47.564848313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1910351Z [rank2]:[W1204 13:37:47.565466530 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1910728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1911191Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1911659Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1912223Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1912822Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1913165Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1913401Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1913635Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1913772Z 2025-12-04T13:44:26.1914007Z [rank2]:[W1204 13:37:47.567563514 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1914442Z [rank1]:[W1204 13:37:48.188237048 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1914817Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1915277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1915728Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1916294Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1916894Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1917233Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1917529Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1917758Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1917894Z 2025-12-04T13:44:26.1918143Z [rank1]:[W1204 13:37:48.189498651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1918590Z [rank3]:[W1204 13:37:48.565065452 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1918965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1919427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1919881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1920460Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1921064Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1921404Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1921643Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1921872Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1922009Z 2025-12-04T13:44:26.1922244Z [rank3]:[W1204 13:37:48.567154926 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1922679Z [rank2]:[W1204 13:37:48.567701764 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1923054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1923519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1923972Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1924540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1925137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1925476Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1925713Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1925953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1926091Z 2025-12-04T13:44:26.1926326Z [rank2]:[W1204 13:37:48.569955864 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1926780Z [rank1]:[W1204 13:37:49.189639871 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1927166Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1927664Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1928113Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1928677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1929291Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1929631Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1929871Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1930102Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1930237Z 2025-12-04T13:44:26.1930476Z [rank1]:[W1204 13:37:49.190873894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1930915Z [rank3]:[W1204 13:37:49.567612909 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1931294Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1931756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1932208Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1932770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1933373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1933716Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1933955Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1934185Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1934321Z 2025-12-04T13:44:26.1934572Z [rank3]:[W1204 13:37:49.570012407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1935009Z [rank2]:[W1204 13:37:49.570039456 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1935404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1935879Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1936329Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1936893Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1937541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1937884Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1938122Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1938352Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1938489Z 2025-12-04T13:44:26.1938726Z [rank2]:[W1204 13:37:49.572417424 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1939162Z [rank1]:[W1204 13:37:50.191027894 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1939540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1940000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1940450Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1941011Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1941608Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1941949Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1942186Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1942416Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1942551Z 2025-12-04T13:44:26.1942804Z [rank1]:[W1204 13:37:50.192279016 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1943239Z [rank3]:[W1204 13:37:50.570136367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1943634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1944109Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1944562Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1945126Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1945741Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1946083Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1946322Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1946551Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1946686Z 2025-12-04T13:44:26.1946922Z [rank3]:[W1204 13:37:50.572057695 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1947356Z [rank2]:[W1204 13:37:50.572503926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1947775Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1948235Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1948689Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1949255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1949856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1950199Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1950438Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1950668Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1950805Z 2025-12-04T13:44:26.1951039Z [rank2]:[W1204 13:37:50.573780527 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1951491Z [rank1]:[W1204 13:37:51.192700161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1951872Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1952350Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1952812Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1953375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1953976Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1954330Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1954567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1954797Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1954933Z 2025-12-04T13:44:26.1955166Z [rank1]:[W1204 13:37:51.194260006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1955602Z [rank3]:[W1204 13:37:51.572255905 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1955979Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1956441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1956892Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1957454Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1958109Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1958450Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1958689Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1958925Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1959060Z 2025-12-04T13:44:26.1959297Z [rank3]:[W1204 13:37:51.574287170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1959755Z [rank2]:[W1204 13:37:51.573913148 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1960135Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1960597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1961063Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1961646Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1962246Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1962602Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1962840Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1963073Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1963211Z 2025-12-04T13:44:26.1963446Z [rank2]:[W1204 13:37:51.575755398 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1963885Z [rank1]:[W1204 13:37:52.194420807 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1964262Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1964724Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1965174Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1965736Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1966338Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1966678Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1966918Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1967147Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1967284Z 2025-12-04T13:44:26.1967566Z [rank1]:[W1204 13:37:52.195720018 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1968005Z [rank3]:[W1204 13:37:52.574447170 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1968396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1968860Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1969326Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1969901Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1970499Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1970841Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1971092Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1971323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1971458Z 2025-12-04T13:44:26.1971695Z [rank3]:[W1204 13:37:52.576646672 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1972131Z [rank2]:[W1204 13:37:52.575887969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1972507Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1972970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1973427Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1973994Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1974596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1974940Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1975177Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1975408Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1975545Z 2025-12-04T13:44:26.1975779Z [rank2]:[W1204 13:37:52.578257917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1976217Z [rank1]:[W1204 13:37:53.195868339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1976596Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1977071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1977633Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1978226Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1978845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1979187Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1979425Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1979656Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1979818Z 2025-12-04T13:44:26.1980052Z [rank1]:[W1204 13:37:53.197133861 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1980488Z [rank3]:[W1204 13:37:53.576766363 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1980864Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1981326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1981780Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1982351Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1982949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1983293Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1983530Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1983759Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1983895Z 2025-12-04T13:44:26.1984131Z [rank3]:[W1204 13:37:53.579138371 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1984568Z [rank2]:[W1204 13:37:53.578360508 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1984944Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1985423Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1985877Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1986457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1987068Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1987409Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1987686Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1987916Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1988053Z 2025-12-04T13:44:26.1988287Z [rank2]:[W1204 13:37:53.580655448 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1988752Z [rank1]:[W1204 13:37:54.197284242 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1989128Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1989591Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1990043Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1990605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1991206Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1991548Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1991785Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1992019Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1992155Z 2025-12-04T13:44:26.1992390Z [rank1]:[W1204 13:37:54.198538164 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1992831Z [rank3]:[W1204 13:37:54.579248813 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1993207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1993668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1994136Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1994701Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1995313Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1995668Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.1995905Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1996136Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.1996271Z 2025-12-04T13:44:26.1996508Z [rank3]:[W1204 13:37:54.581660360 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.1996959Z [rank2]:[W1204 13:37:54.580753160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.1997339Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.1997860Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.1998311Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1998880Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.1999485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.1999830Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2000068Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2000298Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2000434Z 2025-12-04T13:44:26.2000673Z [rank2]:[W1204 13:37:54.582688917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2001111Z [rank1]:[W1204 13:37:55.198690245 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2001492Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2001956Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2002407Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2002987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2003598Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2003953Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2004192Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2004422Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2004560Z 2025-12-04T13:44:26.2004795Z [rank1]:[W1204 13:37:55.200284740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2005229Z [rank3]:[W1204 13:37:55.581776312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2005623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2006092Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2006544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2007114Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2007761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2008104Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2008342Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2008572Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2008707Z 2025-12-04T13:44:26.2008943Z [rank3]:[W1204 13:37:55.583978043 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2009380Z [rank2]:[W1204 13:37:55.582797769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2009761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2010223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2010674Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2011251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2011856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2012219Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2012470Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2012700Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2012836Z 2025-12-04T13:44:26.2013070Z [rank2]:[W1204 13:37:55.584982181 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2013507Z [rank1]:[W1204 13:37:56.200432461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2013901Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2014359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2014811Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2015375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2015973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2016319Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2016558Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2016789Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2016925Z 2025-12-04T13:44:26.2017160Z [rank1]:[W1204 13:37:56.201998626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2017636Z [rank3]:[W1204 13:37:56.584121544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2018018Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2018482Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2018933Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2019497Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2020110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2020454Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2020705Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2020956Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2021092Z 2025-12-04T13:44:26.2021325Z [rank3]:[W1204 13:37:56.586266787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2021759Z [rank2]:[W1204 13:37:56.585122812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2022136Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2022614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2023066Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2023630Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2024233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2024575Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2024812Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2025046Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2025183Z 2025-12-04T13:44:26.2025419Z [rank2]:[W1204 13:37:56.587507110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2025856Z [rank1]:[W1204 13:37:57.202090419 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2026234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2026490Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2026654Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2027022Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2027232Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2027339Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2027436Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2027573Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2027590Z 2025-12-04T13:44:26.2027836Z [rank1]:[W1204 13:37:57.203232764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2028010Z [rank3]:[W1204 13:37:57.586330140 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2028188Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2028444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2028625Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2028991Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2029193Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2029298Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2029394Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2029491Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2029494Z 2025-12-04T13:44:26.2029726Z [rank3]:[W1204 13:37:57.588465723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2029897Z [rank2]:[W1204 13:37:57.587605302 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2030072Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2030334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2030497Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2030864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2031068Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2031183Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2031280Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2031377Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2031379Z 2025-12-04T13:44:26.2031622Z [rank2]:[W1204 13:37:57.589955431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2031801Z [rank1]:[W1204 13:37:58.203360115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2031976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2032231Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2032407Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2032773Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2032974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2033079Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2033175Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2033272Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2033276Z 2025-12-04T13:44:26.2033509Z [rank1]:[W1204 13:37:58.204600828 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2033679Z [rank3]:[W1204 13:37:58.588617045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2033854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2034108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2034270Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2034642Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2034846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2034950Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2035054Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2035152Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2035154Z 2025-12-04T13:44:26.2035388Z [rank3]:[W1204 13:37:58.591142129 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2035567Z [rank2]:[W1204 13:37:58.590080833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2035752Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2036007Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2036169Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2036547Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2036752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2036856Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2036952Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2037049Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2037051Z 2025-12-04T13:44:26.2037286Z [rank2]:[W1204 13:37:58.592179697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2037458Z [rank1]:[W1204 13:37:59.204694261 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2037688Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2037942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2038108Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2038479Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2038683Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2038788Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2038882Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2038995Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2038998Z 2025-12-04T13:44:26.2039232Z [rank1]:[W1204 13:37:59.205836716 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2039423Z [rank3]:[W1204 13:37:59.591277791 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2039609Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2039864Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2040029Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2040399Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2040616Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2040723Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2040818Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2040915Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2040917Z 2025-12-04T13:44:26.2041155Z [rank3]:[W1204 13:37:59.593915383 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2041328Z [rank2]:[W1204 13:37:59.592270649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2041503Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2041761Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2041923Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2042290Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2042495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2042600Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2042696Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2042791Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2042793Z 2025-12-04T13:44:26.2043046Z [rank2]:[W1204 13:37:59.594666917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2043217Z [rank1]:[W1204 13:38:00.205944158 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2043403Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2043671Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2043832Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2044199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2044412Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2044518Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2044612Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2044708Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2044710Z 2025-12-04T13:44:26.2044945Z [rank1]:[W1204 13:38:00.207416216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2045114Z [rank3]:[W1204 13:38:00.594012246 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2045289Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2045547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2045711Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2046075Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2046277Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2046382Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2046478Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2046576Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2046578Z 2025-12-04T13:44:26.2046811Z [rank3]:[W1204 13:38:00.596042851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2046990Z [rank2]:[W1204 13:38:00.594760389 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2047165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2047432Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2047630Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2047999Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2048218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2048322Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2048419Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2048515Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2048517Z 2025-12-04T13:44:26.2048751Z [rank2]:[W1204 13:38:00.597121028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2048921Z [rank1]:[W1204 13:38:01.207535398 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2049097Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2049352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2049514Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2049883Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2050085Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2050190Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2050285Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2050383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2050385Z 2025-12-04T13:44:26.2050618Z [rank1]:[W1204 13:38:01.208792481 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2050800Z [rank3]:[W1204 13:38:01.596185773 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2050977Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2051244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2051421Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2051786Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2051991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2052111Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2052206Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2052303Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2052306Z 2025-12-04T13:44:26.2052538Z [rank3]:[W1204 13:38:01.598342626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2052709Z [rank2]:[W1204 13:38:01.597279889 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2052883Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2053138Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2053301Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2053666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2053869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2053972Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2054069Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2054166Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2054169Z 2025-12-04T13:44:26.2054404Z [rank2]:[W1204 13:38:01.599495800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2054574Z [rank1]:[W1204 13:38:02.208928233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2054765Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2055020Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2055192Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2055568Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2055769Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2055874Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2055978Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2056074Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2056076Z 2025-12-04T13:44:26.2056310Z [rank1]:[W1204 13:38:02.210145486 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2056481Z [rank3]:[W1204 13:38:02.598547306 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2056657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2056912Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2057076Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2057441Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2057699Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2057804Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2057899Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2057996Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2057998Z 2025-12-04T13:44:26.2058231Z [rank3]:[W1204 13:38:02.600511683 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2058403Z [rank2]:[W1204 13:38:02.599584084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2058578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2058850Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2059019Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2059397Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2059612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2059718Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2059816Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2059923Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2059925Z 2025-12-04T13:44:26.2060160Z [rank2]:[W1204 13:38:02.601998001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2060330Z [rank1]:[W1204 13:38:03.210266448 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2060504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2060762Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2060926Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2061293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2061495Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2061600Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2061695Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2061792Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2061795Z 2025-12-04T13:44:26.2062029Z [rank1]:[W1204 13:38:03.212025490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2062198Z [rank3]:[W1204 13:38:03.600909800 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2062372Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2062636Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2062803Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2063183Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2063395Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2063500Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2063596Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2063693Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2063705Z 2025-12-04T13:44:26.2063937Z [rank3]:[W1204 13:38:03.603191020 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2064110Z [rank2]:[W1204 13:38:03.602098794 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2064284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2064541Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2064705Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2065073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2065280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2065384Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2065479Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2065575Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2065580Z 2025-12-04T13:44:26.2065813Z [rank2]:[W1204 13:38:03.604413433 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2065985Z [rank1]:[W1204 13:38:04.212126763 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2066159Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2066414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2066584Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2066962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2067174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2067280Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2067376Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2067512Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2067517Z 2025-12-04T13:44:26.2067752Z [rank1]:[W1204 13:38:04.213601870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2067946Z [rank3]:[W1204 13:38:04.603293403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2068122Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2068377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2068542Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2068907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2069111Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2069216Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2069311Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2069409Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2069411Z 2025-12-04T13:44:26.2069648Z [rank3]:[W1204 13:38:04.605601862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2069822Z [rank2]:[W1204 13:38:04.604470127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2069996Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2070252Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2070415Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2070796Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2071013Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2071128Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2071224Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2071321Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2071323Z 2025-12-04T13:44:26.2071557Z [rank2]:[W1204 13:38:04.606872284 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2071741Z [rank1]:[W1204 13:38:05.213735563 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2071915Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2072172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2072334Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2072702Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2072905Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2073010Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2073105Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2073200Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2073202Z 2025-12-04T13:44:26.2073438Z [rank1]:[W1204 13:38:05.215242330 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2073609Z [rank3]:[W1204 13:38:05.605725205 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2073785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2074042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2074205Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2074583Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2074784Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2074901Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2075005Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2075100Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2075102Z 2025-12-04T13:44:26.2075336Z [rank3]:[W1204 13:38:05.607868968 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2075508Z [rank2]:[W1204 13:38:05.606937228 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2075693Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2075948Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2076112Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2076478Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2076683Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2076793Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2076891Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2076988Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2076989Z 2025-12-04T13:44:26.2077223Z [rank2]:[W1204 13:38:05.610348763 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2077394Z [rank1]:[W1204 13:38:06.215382252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2077593Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2077848Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2078012Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2078405Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2078606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2078715Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2078828Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2078940Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2078941Z 2025-12-04T13:44:26.2079173Z [rank1]:[W1204 13:38:06.216625125 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2079343Z [rank3]:[W1204 13:38:06.607985301 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2079518Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2079792Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2079955Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2080320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2080521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2080627Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2080721Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2080819Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2080822Z 2025-12-04T13:44:26.2081053Z [rank3]:[W1204 13:38:06.610560904 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2081224Z [rank2]:[W1204 13:38:06.610410098 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2081398Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2081652Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2081816Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2082180Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2082394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2082501Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2082599Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2082706Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2082725Z 2025-12-04T13:44:26.2082958Z [rank2]:[W1204 13:38:06.612819375 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2083129Z [rank1]:[W1204 13:38:07.216795517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2083304Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2083560Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2083734Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2084100Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2084302Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2084406Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2084505Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2084600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2084602Z 2025-12-04T13:44:26.2084838Z [rank1]:[W1204 13:38:07.218834632 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2085010Z [rank3]:[W1204 13:38:07.610717847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2085184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2085437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2085600Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2085966Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2086169Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2086283Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2086378Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2086477Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2086479Z 2025-12-04T13:44:26.2086722Z [rank3]:[W1204 13:38:07.613073465 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2086906Z [rank2]:[W1204 13:38:07.612905959 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2087078Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2087334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2087553Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2087918Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2088121Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2088226Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2088321Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2088419Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2088423Z 2025-12-04T13:44:26.2088655Z [rank2]:[W1204 13:38:07.615353545 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2088826Z [rank1]:[W1204 13:38:08.219039144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2089001Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2089256Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2089417Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2089785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2089987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2090091Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2090207Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2090303Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2090306Z 2025-12-04T13:44:26.2090567Z [rank1]:[W1204 13:38:08.221036380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2090753Z [rank3]:[W1204 13:38:08.613238547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2090929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2091184Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2091348Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2091729Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2091932Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2092036Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2092130Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2092227Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2092230Z 2025-12-04T13:44:26.2092463Z [rank3]:[W1204 13:38:08.615692583 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2092634Z [rank2]:[W1204 13:38:08.615452279 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2092810Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2093066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2093230Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2093597Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2093801Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2093905Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2094001Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2094107Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2094110Z 2025-12-04T13:44:26.2094343Z [rank2]:[W1204 13:38:08.617818757 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2094524Z [rank1]:[W1204 13:38:09.221228091 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2094707Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2094960Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2095123Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2095492Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2095707Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2095811Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2095907Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2096002Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2096006Z 2025-12-04T13:44:26.2096239Z [rank1]:[W1204 13:38:09.223167499 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2096410Z [rank3]:[W1204 13:38:09.615849836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2096585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2096839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2097004Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2097374Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2097611Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2097717Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2097811Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2097907Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2097909Z 2025-12-04T13:44:26.2098154Z [rank3]:[W1204 13:38:09.617225906 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2098328Z [rank2]:[W1204 13:38:09.617936260 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2098522Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2098790Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2098954Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2099324Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2099541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2099646Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2099742Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2099837Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2099840Z 2025-12-04T13:44:26.2100072Z [rank2]:[W1204 13:38:09.619393318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2100243Z [rank1]:[W1204 13:38:10.223314862 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2100418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2100673Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2100834Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2101198Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2101401Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2101506Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2101602Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2101697Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2101699Z 2025-12-04T13:44:26.2101942Z [rank1]:[W1204 13:38:10.224681862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2102113Z [rank3]:[W1204 13:38:10.617422037 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2102288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2102552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2102725Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2103090Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2103301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2103407Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2103504Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2103602Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2103603Z 2025-12-04T13:44:26.2103836Z [rank3]:[W1204 13:38:10.619224628 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2104008Z [rank2]:[W1204 13:38:10.619500172 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2104184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2104437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2104602Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2104967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2105171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2105276Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2105373Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2105468Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2105472Z 2025-12-04T13:44:26.2105707Z [rank2]:[W1204 13:38:10.620667886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2105887Z [rank1]:[W1204 13:38:11.224850994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2106061Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2106330Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2106502Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2106867Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2107070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2107189Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2107287Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2107383Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2107384Z 2025-12-04T13:44:26.2107656Z [rank1]:[W1204 13:38:11.226771082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2107828Z [rank3]:[W1204 13:38:11.619415220 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2108006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2108262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2108425Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2108791Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2108994Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2109100Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2109195Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2109293Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2109295Z 2025-12-04T13:44:26.2109526Z [rank3]:[W1204 13:38:11.621159592 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2109696Z [rank2]:[W1204 13:38:11.620754910 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2109884Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2110141Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2110318Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2110695Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2110898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2111001Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2111112Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2111209Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2111212Z 2025-12-04T13:44:26.2111443Z [rank2]:[W1204 13:38:11.623085179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2111614Z [rank1]:[W1204 13:38:12.226935625 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2111788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2112043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2112207Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2112577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2112780Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2112885Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2112982Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2113076Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2113078Z 2025-12-04T13:44:26.2113312Z [rank1]:[W1204 13:38:12.228954940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2113483Z [rank3]:[W1204 13:38:12.621331664 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2113658Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2113932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2114096Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2114474Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2114687Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2114793Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2114887Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2114998Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2115000Z 2025-12-04T13:44:26.2115233Z [rank3]:[W1204 13:38:12.623064866 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2115406Z [rank2]:[W1204 13:38:12.623194253 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2115581Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2115834Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2115998Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2116364Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2116568Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2116673Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2116769Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2116867Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2116869Z 2025-12-04T13:44:26.2117102Z [rank2]:[W1204 13:38:12.624357588 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2117275Z [rank1]:[W1204 13:38:13.229103134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2117447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2117748Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2117910Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2118293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2118507Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2118611Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2118708Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2118803Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2118818Z 2025-12-04T13:44:26.2119054Z [rank1]:[W1204 13:38:13.230634060 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2119225Z [rank3]:[W1204 13:38:13.623235079 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2119400Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2119657Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2119820Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2120189Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2120389Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2120494Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2120588Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2120685Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2120687Z 2025-12-04T13:44:26.2120924Z [rank3]:[W1204 13:38:13.624695857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2121099Z [rank2]:[W1204 13:38:13.624423883 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2121274Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2121527Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2121701Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2122077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2122293Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2122396Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2122491Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2122588Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2122590Z 2025-12-04T13:44:26.2122821Z [rank2]:[W1204 13:38:13.625563878 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2123006Z [rank1]:[W1204 13:38:14.230795623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2123181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2123439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2123601Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2123969Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2124172Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2124276Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2124370Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2124465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2124468Z 2025-12-04T13:44:26.2129298Z [rank1]:[W1204 13:38:14.233338607 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2129482Z [rank3]:[W1204 13:38:14.624871909 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2129662Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2129927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2130094Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2130499Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2130716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2130848Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2130944Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2131039Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2131042Z 2025-12-04T13:44:26.2131277Z [rank3]:[W1204 13:38:14.626121812 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2131463Z [rank2]:[W1204 13:38:14.625669342 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2131640Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2131894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2132057Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2132426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2132630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2132736Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2132832Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2132929Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2132931Z 2025-12-04T13:44:26.2133164Z [rank2]:[W1204 13:38:14.626826766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2133334Z [rank1]:[W1204 13:38:15.233492100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2133510Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2133764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2133926Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2134302Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2134506Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2134619Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2134725Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2134821Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2134823Z 2025-12-04T13:44:26.2135058Z [rank1]:[W1204 13:38:15.234727093 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2135228Z [rank3]:[W1204 13:38:15.626294085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2135417Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2135672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2135835Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2136205Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2136406Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2136511Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2136607Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2136703Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2136705Z 2025-12-04T13:44:26.2136937Z [rank3]:[W1204 13:38:15.627553067 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2137108Z [rank2]:[W1204 13:38:15.626999259 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2137284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2137574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2137738Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2138117Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2138320Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2138426Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2138533Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2138643Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2138645Z 2025-12-04T13:44:26.2138880Z [rank2]:[W1204 13:38:15.628385439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2139053Z [rank1]:[W1204 13:38:16.234889797 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2139227Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2139496Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2139658Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2140025Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2140228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2140333Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2140429Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2140525Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2140527Z 2025-12-04T13:44:26.2140760Z [rank1]:[W1204 13:38:16.236136890 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2140932Z [rank3]:[W1204 13:38:16.627695371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2141109Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2141365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2141527Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2141893Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2142104Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2142211Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2142307Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2142414Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2142425Z 2025-12-04T13:44:26.2142659Z [rank3]:[W1204 13:38:16.628883185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2142828Z [rank2]:[W1204 13:38:16.628497124 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2143004Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2143262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2143436Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2143802Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2144005Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2144110Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2144206Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2144304Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2144306Z 2025-12-04T13:44:26.2144536Z [rank2]:[W1204 13:38:16.630065949 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2144707Z [rank1]:[W1204 13:38:17.236282233 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2144881Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2145136Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2145303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2145671Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2145884Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2145988Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2146084Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2146179Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2146181Z 2025-12-04T13:44:26.2146428Z [rank1]:[W1204 13:38:17.238474855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2146608Z [rank3]:[W1204 13:38:17.629487869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2146783Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2147039Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2147212Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2147615Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2147817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2147923Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2148018Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2148114Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2148116Z 2025-12-04T13:44:26.2148349Z [rank3]:[W1204 13:38:17.631628862 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2148519Z [rank2]:[W1204 13:38:17.630227823 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2148694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2148949Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2149114Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2149478Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2149684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2149788Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2149897Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2149994Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2149997Z 2025-12-04T13:44:26.2150241Z [rank2]:[W1204 13:38:17.631972295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2150426Z [rank1]:[W1204 13:38:18.238596949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2150599Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2150859Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2151020Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2151402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2151604Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2151706Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2151802Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2151897Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2151900Z 2025-12-04T13:44:26.2152135Z [rank1]:[W1204 13:38:18.240210484 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2152304Z [rank3]:[W1204 13:38:18.631770586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2152479Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2152735Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2152896Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2153264Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2153467Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2153572Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2153666Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2153772Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2153774Z 2025-12-04T13:44:26.2154009Z [rank3]:[W1204 13:38:18.633366211 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2154189Z [rank2]:[W1204 13:38:18.632059410 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2154375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2154628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2154792Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2155168Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2155371Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2155475Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2155569Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2155666Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2155668Z 2025-12-04T13:44:26.2155901Z [rank2]:[W1204 13:38:18.634079045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2156073Z [rank1]:[W1204 13:38:19.240347138 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2156251Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2156507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2156668Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2157034Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2157237Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2157341Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2157436Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2157563Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2157565Z 2025-12-04T13:44:26.2157812Z [rank1]:[W1204 13:38:19.242759725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2157983Z [rank3]:[W1204 13:38:19.633502425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2158177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2158451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2158615Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2158988Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2159205Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2159310Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2159404Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2159499Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2159501Z 2025-12-04T13:44:26.2159735Z [rank3]:[W1204 13:38:19.635555490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2159906Z [rank2]:[W1204 13:38:19.634166981 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2160082Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2160336Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2160500Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2160870Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2161073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2161178Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2161273Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2161370Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2161373Z 2025-12-04T13:44:26.2161621Z [rank2]:[W1204 13:38:19.636704415 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2161791Z [rank1]:[W1204 13:38:20.242897370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2161977Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2162244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2162406Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2162773Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2162987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2163091Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2163186Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2163281Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2163283Z 2025-12-04T13:44:26.2163518Z [rank1]:[W1204 13:38:20.245172590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2163688Z [rank3]:[W1204 13:38:20.635742403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2163865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2164121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2164284Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2164651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2164852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2164959Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2165055Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2166863Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2166865Z 2025-12-04T13:44:26.2167098Z [rank3]:[W1204 13:38:20.637750019 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2168029Z [rank2]:[W1204 13:38:20.636823240 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2168204Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2168479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2168643Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2169033Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2169238Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2169343Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2169440Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2169537Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2169541Z 2025-12-04T13:44:26.2169776Z [rank2]:[W1204 13:38:20.638328817 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2169946Z [rank1]:[W1204 13:38:21.245290364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2170123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2170377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2170539Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2170906Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2171108Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2171213Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2171308Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2171404Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2171406Z 2025-12-04T13:44:26.2171695Z [rank1]:[W1204 13:38:21.247282091 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2171878Z [rank3]:[W1204 13:38:21.637855845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2172069Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2172328Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2172503Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2172872Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2173074Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2173178Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2173275Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2173371Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2173372Z 2025-12-04T13:44:26.2173543Z [rank2]:[W1204 13:38:21.638414062 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2173716Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2173970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2174137Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2174503Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2174706Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2174808Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2174905Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2175002Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2175004Z 2025-12-04T13:44:26.2175239Z [rank2]:[W1204 13:38:21.639948169 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2175467Z [rank3]:[W1204 13:38:21.639949099 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2175653Z [rank1]:[W1204 13:38:22.247403126 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2175843Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2176110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2176288Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2176656Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2176859Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2176965Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2177060Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2177155Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2177157Z 2025-12-04T13:44:26.2177390Z [rank1]:[W1204 13:38:22.249445381 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2177602Z [rank2]:[W1204 13:38:22.640079163 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2177777Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2178034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2178196Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2178569Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2178775Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2178880Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2178977Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2179072Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2179074Z 2025-12-04T13:44:26.2179308Z [rank2]:[W1204 13:38:22.641542961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2179478Z [rank3]:[W1204 13:38:22.640079193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2179666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2179935Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2180114Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2180494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2180697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2180804Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2180900Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2180997Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2180999Z 2025-12-04T13:44:26.2181233Z [rank3]:[W1204 13:38:22.641884234 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2181403Z [rank1]:[W1204 13:38:23.249570356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2181578Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2181832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2181995Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2182362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2182564Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2182668Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2182763Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2182860Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2182862Z 2025-12-04T13:44:26.2183094Z [rank1]:[W1204 13:38:23.251848976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2183266Z [rank2]:[W1204 13:38:23.641699325 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2183457Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2183723Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2183900Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2184279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2184481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2184585Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2184681Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2184779Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2184781Z 2025-12-04T13:44:26.2185015Z [rank2]:[W1204 13:38:23.643781760 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2185187Z [rank3]:[W1204 13:38:23.642019698 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2185360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2185615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2185778Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2186146Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2186348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2186453Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2186549Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2186644Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2186647Z 2025-12-04T13:44:26.2186880Z [rank3]:[W1204 13:38:23.644135742 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2187051Z [rank1]:[W1204 13:38:24.251976361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2187225Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2187535Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2187711Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2188092Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2188306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2188412Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2188506Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2188602Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2188604Z 2025-12-04T13:44:26.2188839Z [rank1]:[W1204 13:38:24.254561734 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2189009Z [rank3]:[W1204 13:38:24.644242178 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2189184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2189441Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2189607Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2189973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2190177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2190280Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2190377Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2190474Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2190476Z 2025-12-04T13:44:26.2190710Z [rank3]:[W1204 13:38:24.646041238 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2190881Z [rank2]:[W1204 13:38:24.643939314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2191054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2191309Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2191484Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2191866Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2192094Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2192199Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2192296Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2192391Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2192393Z 2025-12-04T13:44:26.2192626Z [rank2]:[W1204 13:38:24.646485698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2192796Z [rank1]:[W1204 13:38:25.254723228 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2192974Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2193228Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2193391Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2193757Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2193963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2194069Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2194164Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2194261Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2194263Z 2025-12-04T13:44:26.2194495Z [rank1]:[W1204 13:38:25.256628667 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2194665Z [rank3]:[W1204 13:38:25.646190083 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2194840Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2195094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2195270Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2195646Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2195859Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2195980Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2196076Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2196172Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2196175Z 2025-12-04T13:44:26.2196416Z [rank3]:[W1204 13:38:25.647401776 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2196588Z [rank2]:[W1204 13:38:25.646621613 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2196762Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2197018Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2197181Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2197593Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2197798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2197902Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2197998Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2198094Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2198097Z 2025-12-04T13:44:26.2198330Z [rank2]:[W1204 13:38:25.648986651 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2198502Z [rank1]:[W1204 13:38:26.256791191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2198676Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2198931Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2199092Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2199474Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2199701Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2199807Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2199915Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2200012Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2200013Z 2025-12-04T13:44:26.2200249Z [rank1]:[W1204 13:38:26.258303058 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2200418Z [rank3]:[W1204 13:38:26.647575390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2200594Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2200847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2201011Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2201377Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2201580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2201685Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2201779Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2201876Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2201878Z 2025-12-04T13:44:26.2202111Z [rank3]:[W1204 13:38:26.649843941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2202287Z [rank2]:[W1204 13:38:26.649097817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2202464Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2202724Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2202887Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2203252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2203477Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2203593Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2203688Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2203785Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2203799Z 2025-12-04T13:44:26.2204032Z [rank2]:[W1204 13:38:26.651016925 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2204204Z [rank1]:[W1204 13:38:27.258418024 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2204381Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2204638Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2204802Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2205171Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2205375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2205481Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2205578Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2205673Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2205675Z 2025-12-04T13:44:26.2205909Z [rank1]:[W1204 13:38:27.259761794 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2206078Z [rank3]:[W1204 13:38:27.649984416 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2206253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2206509Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2206675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2207051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2207262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2207384Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2207534Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2207631Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2207633Z 2025-12-04T13:44:26.2207878Z [rank3]:[W1204 13:38:27.651848095 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2208048Z [rank2]:[W1204 13:38:27.651118941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2208223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2208479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2208644Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2209009Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2209212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2209315Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2209413Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2209510Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2209511Z 2025-12-04T13:44:26.2209746Z [rank2]:[W1204 13:38:27.653253134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2209917Z [rank1]:[W1204 13:38:28.260177943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2210091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2210345Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2210507Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2210873Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2211089Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2211195Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2211303Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2211412Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2211414Z 2025-12-04T13:44:26.2211647Z [rank1]:[W1204 13:38:28.261799208 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2211828Z [rank3]:[W1204 13:38:28.651982560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2212005Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2212262Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2212427Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2212797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2212998Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2213104Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2213199Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2213297Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2213299Z 2025-12-04T13:44:26.2213531Z [rank3]:[W1204 13:38:28.654102624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2213703Z [rank2]:[W1204 13:38:28.653354700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2213876Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2214137Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2214300Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2214666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2214868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2214984Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2215080Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2215185Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2215198Z 2025-12-04T13:44:26.2215432Z [rank2]:[W1204 13:38:28.655510633 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2215612Z [rank1]:[W1204 13:38:29.261931383 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2215787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2216044Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2216205Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2216570Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2216773Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2216878Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2216972Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2217067Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2217069Z 2025-12-04T13:44:26.2217303Z [rank1]:[W1204 13:38:29.263744103 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2217508Z [rank3]:[W1204 13:38:29.654270858 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2217684Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2217937Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2218102Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2218471Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2218672Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2218776Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2218886Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2218982Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2218984Z 2025-12-04T13:44:26.2219229Z [rank3]:[W1204 13:38:29.655871843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2219415Z [rank2]:[W1204 13:38:29.655888873 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2219603Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2219859Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2220023Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2220394Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2220599Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2220702Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2220798Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2220895Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2220897Z 2025-12-04T13:44:26.2221130Z [rank2]:[W1204 13:38:29.658232392 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2221301Z [rank1]:[W1204 13:38:30.263901688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2221474Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2221728Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2221890Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2222258Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2222460Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2222566Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2222661Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2222773Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2222775Z 2025-12-04T13:44:26.2223018Z [rank1]:[W1204 13:38:30.265204010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2223201Z [rank3]:[W1204 13:38:30.656031539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2223375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2223641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2223805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2224170Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2224373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2224478Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2224572Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2224669Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2224673Z 2025-12-04T13:44:26.2224906Z [rank3]:[W1204 13:38:30.657662403 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2225078Z [rank2]:[W1204 13:38:30.658343098 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2225253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2225507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2225673Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2226041Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2226243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2226348Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2226446Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2226542Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2226557Z 2025-12-04T13:44:26.2226793Z [rank2]:[W1204 13:38:30.660380633 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2226977Z [rank1]:[W1204 13:38:31.265322446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2227162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2227427Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2227626Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2227993Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2228195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2228299Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2228394Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2228489Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2228491Z 2025-12-04T13:44:26.2228726Z [rank1]:[W1204 13:38:31.266596058 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2228896Z [rank3]:[W1204 13:38:31.657830168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2229074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2229330Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2229493Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2229858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2230061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2230166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2230260Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2230356Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2230358Z 2025-12-04T13:44:26.2230590Z [rank3]:[W1204 13:38:31.660147537 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2230775Z [rank2]:[W1204 13:38:31.660507249 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2230963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2231232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2231410Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2231778Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2231984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2232088Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2232185Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2232280Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2232284Z 2025-12-04T13:44:26.2232515Z [rank2]:[W1204 13:38:31.662769879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2232686Z [rank1]:[W1204 13:38:32.266709924 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2232861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2233115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2233276Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2233647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2233849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2233954Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2234050Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2234144Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2234146Z 2025-12-04T13:44:26.2234380Z [rank1]:[W1204 13:38:32.267881678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2234570Z [rank3]:[W1204 13:38:32.660309992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2234745Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2235009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2235187Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2235573Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2235776Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2235882Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2235978Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2236074Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2236076Z 2025-12-04T13:44:26.2236308Z [rank3]:[W1204 13:38:32.661553965 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2236480Z [rank2]:[W1204 13:38:32.662845576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2236654Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2236908Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2237071Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2237437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2237671Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2237774Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2237873Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2237968Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2237972Z 2025-12-04T13:44:26.2238203Z [rank2]:[W1204 13:38:32.663998641 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2238373Z [rank1]:[W1204 13:38:33.268022284 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2238562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2238835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2239012Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2239392Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2239597Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2239701Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2239797Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2239892Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2239894Z 2025-12-04T13:44:26.2240129Z [rank1]:[W1204 13:38:33.270301114 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2240298Z [rank3]:[W1204 13:38:33.661744219 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2240474Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2240729Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2240894Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2241260Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2241459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2241567Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2241662Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2241759Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2241761Z 2025-12-04T13:44:26.2241995Z [rank3]:[W1204 13:38:33.663116939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2242166Z [rank2]:[W1204 13:38:33.664097948 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2242341Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2242607Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2242782Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2243171Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2243373Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2243478Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2243574Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2243669Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2243674Z 2025-12-04T13:44:26.2243907Z [rank2]:[W1204 13:38:33.665307781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2244079Z [rank1]:[W1204 13:38:34.270477789 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2244252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2244507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2244669Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2245035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2245238Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2245340Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2245437Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2245531Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2245533Z 2025-12-04T13:44:26.2245768Z [rank1]:[W1204 13:38:34.272364898 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2245937Z [rank3]:[W1204 13:38:34.663265835 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2246113Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2246367Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2246544Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2246918Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2247146Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2247251Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2247346Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2247441Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2247443Z 2025-12-04T13:44:26.2247715Z [rank3]:[W1204 13:38:34.664476129 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2247887Z [rank2]:[W1204 13:38:34.665412508 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2248062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2248319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2248483Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2248850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2249052Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2249157Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2249252Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2249349Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2249351Z 2025-12-04T13:44:26.2249584Z [rank2]:[W1204 13:38:34.667974262 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2249753Z [rank1]:[W1204 13:38:35.272545863 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2249926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2250179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2250359Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2250742Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2250967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2251083Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2251180Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2251276Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2251278Z 2025-12-04T13:44:26.2251511Z [rank1]:[W1204 13:38:35.274776644 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2251682Z [rank3]:[W1204 13:38:35.664617105 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2251857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2252111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2252274Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2252644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2252846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2252950Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2253045Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2253141Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2253144Z 2025-12-04T13:44:26.2253375Z [rank3]:[W1204 13:38:35.666328707 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2253546Z [rank2]:[W1204 13:38:35.668085389 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2253721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2253975Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2254139Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2254530Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2254744Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2254849Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2254956Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2255052Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2255056Z 2025-12-04T13:44:26.2255289Z [rank2]:[W1204 13:38:35.670312460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2255460Z [rank1]:[W1204 13:38:36.274888451 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2255636Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2255892Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2256053Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2256426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2256631Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2256734Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2256829Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2256924Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2256926Z 2025-12-04T13:44:26.2257161Z [rank1]:[W1204 13:38:36.276856828 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2257331Z [rank3]:[W1204 13:38:36.666472873 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2257565Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2257823Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2257987Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2258357Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2258588Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2258706Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2258801Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2258910Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2258912Z 2025-12-04T13:44:26.2259145Z [rank3]:[W1204 13:38:36.668487459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2259315Z [rank2]:[W1204 13:38:36.670414087 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2259491Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2259747Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2259912Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2260280Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2260484Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2260588Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2260684Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2260782Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2260784Z 2025-12-04T13:44:26.2261017Z [rank2]:[W1204 13:38:36.671558732 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2261186Z [rank1]:[W1204 13:38:37.276982864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2261360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2261614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2261775Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2262141Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2262355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2262481Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2262579Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2262673Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2262675Z 2025-12-04T13:44:26.2262924Z [rank1]:[W1204 13:38:37.279076699 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2263097Z [rank3]:[W1204 13:38:37.668614366 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2263272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2263530Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2263691Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2264057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2264258Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2264364Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2264459Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2264555Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2264557Z 2025-12-04T13:44:26.2264790Z [rank3]:[W1204 13:38:37.670714720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2264961Z [rank2]:[W1204 13:38:37.671661139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2265138Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2265394Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2265557Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2265923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2266136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2266241Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2266353Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2266461Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2266463Z 2025-12-04T13:44:26.2266707Z [rank2]:[W1204 13:38:37.673532218 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2266877Z [rank1]:[W1204 13:38:38.279231765 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2267051Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2267309Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2267511Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2267878Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2268079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2268185Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2268280Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2268376Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2268378Z 2025-12-04T13:44:26.2268613Z [rank1]:[W1204 13:38:38.281167462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2268782Z [rank3]:[W1204 13:38:38.670869476 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2268958Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2269217Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2269381Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2269753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2269953Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2270075Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2270169Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2270279Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2270295Z 2025-12-04T13:44:26.2270530Z [rank3]:[W1204 13:38:38.672104899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2270711Z [rank2]:[W1204 13:38:38.673629515 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2270887Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2271142Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2271306Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2271674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2271878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2271985Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2272080Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2272176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2272180Z 2025-12-04T13:44:26.2272412Z [rank2]:[W1204 13:38:38.674785640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2272582Z [rank1]:[W1204 13:38:39.281289879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2272755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2273011Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2273174Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2273540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2273743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2273847Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2273960Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2274055Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2274057Z 2025-12-04T13:44:26.2274300Z [rank1]:[W1204 13:38:39.283386853 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2274479Z [rank3]:[W1204 13:38:39.672271125 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2274663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2274920Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2275083Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2275452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2275656Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2275760Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2275856Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2275953Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2275955Z 2025-12-04T13:44:26.2276190Z [rank3]:[W1204 13:38:39.673507457 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2276360Z [rank2]:[W1204 13:38:39.674878387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2276536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2276791Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2276954Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2277320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2277566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2277671Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2277766Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2277880Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2277882Z 2025-12-04T13:44:26.2278127Z [rank2]:[W1204 13:38:39.676016912 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2278310Z [rank1]:[W1204 13:38:40.283547969 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2278484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2278753Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2278916Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2279284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2279486Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2279590Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2279685Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2279781Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2279783Z 2025-12-04T13:44:26.2280022Z [rank1]:[W1204 13:38:40.285695162 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2280192Z [rank3]:[W1204 13:38:40.673806251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2280366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2280622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2280783Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2281150Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2281351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2281455Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2281551Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2281647Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2281668Z 2025-12-04T13:44:26.2281902Z [rank3]:[W1204 13:38:40.676210368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2282083Z [rank2]:[W1204 13:38:40.676125800 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2282270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2282536Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2282700Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2283068Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2283271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2283376Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2283471Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2283568Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2283571Z 2025-12-04T13:44:26.2283802Z [rank2]:[W1204 13:38:40.677279235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2283972Z [rank1]:[W1204 13:38:41.285807549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2284148Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2284405Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2284566Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2284933Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2285136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2285238Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2285334Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2285429Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2285431Z 2025-12-04T13:44:26.2285664Z [rank1]:[W1204 13:38:41.287160130 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2285843Z [rank3]:[W1204 13:38:41.676402524 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2286040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2286296Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2286469Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2286840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2287043Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2287150Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2287244Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2287340Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2287342Z 2025-12-04T13:44:26.2287618Z [rank3]:[W1204 13:38:41.678565816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2287788Z [rank2]:[W1204 13:38:41.677409332 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2287964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2288219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2288383Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2288754Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2288957Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2289063Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2289157Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2289253Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2289255Z 2025-12-04T13:44:26.2289488Z [rank2]:[W1204 13:38:41.679491006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2289674Z [rank1]:[W1204 13:38:42.287620040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2289847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2290129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2290291Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2290668Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2290873Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2290977Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2291075Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2291169Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2291171Z 2025-12-04T13:44:26.2291408Z [rank1]:[W1204 13:38:42.290237152 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2291580Z [rank3]:[W1204 13:38:42.678721013 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2291754Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2292010Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2292173Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2292543Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2292745Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2292850Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2292946Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2293041Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2293043Z 2025-12-04T13:44:26.2293278Z [rank3]:[W1204 13:38:42.680150421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2293447Z [rank2]:[W1204 13:38:42.679590644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2293633Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2293896Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2294072Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2294452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2294657Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2294762Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2294858Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2294955Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2294957Z 2025-12-04T13:44:26.2295192Z [rank2]:[W1204 13:38:42.681746506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2295361Z [rank1]:[W1204 13:38:43.290391199 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2295536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2295791Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2295955Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2296321Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2296522Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2296628Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2296724Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2296821Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2296822Z 2025-12-04T13:44:26.2297058Z [rank1]:[W1204 13:38:43.291988294 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2297228Z [rank3]:[W1204 13:38:43.680327008 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2297401Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2297709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2297903Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2298284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2298485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2298591Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2298685Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2298783Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2298786Z 2025-12-04T13:44:26.2299021Z [rank3]:[W1204 13:38:43.682473141 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2299191Z [rank2]:[W1204 13:38:43.681857424 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2299366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2299620Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2299785Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2300151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2300354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2300459Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2300553Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2300649Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2300654Z 2025-12-04T13:44:26.2300886Z [rank2]:[W1204 13:38:43.683571026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2301056Z [rank1]:[W1204 13:38:44.292153370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2301231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2301500Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2301663Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2302050Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2302264Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2302368Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2302464Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2302559Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2302560Z 2025-12-04T13:44:26.2302795Z [rank1]:[W1204 13:38:44.293894062 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2302965Z [rank3]:[W1204 13:38:44.682628637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2303141Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2303398Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2303563Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2303928Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2304129Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2304233Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2304329Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2304423Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2304425Z 2025-12-04T13:44:26.2304659Z [rank3]:[W1204 13:38:44.684963536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2304828Z [rank2]:[W1204 13:38:44.683685434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2305004Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2305258Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2305434Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2305811Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2306023Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2306138Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2306233Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2306332Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2306334Z 2025-12-04T13:44:26.2306566Z [rank2]:[W1204 13:38:44.685855367 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2306737Z [rank1]:[W1204 13:38:45.294060429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2306910Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2307164Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2307328Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2307769Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2307973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2308077Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2308173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2308269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2308270Z 2025-12-04T13:44:26.2308507Z [rank1]:[W1204 13:38:45.296094884 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2308679Z [rank3]:[W1204 13:38:45.685103203 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2308852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2309111Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2309272Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2309675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2309895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2310000Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2310110Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2310205Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2310208Z 2025-12-04T13:44:26.2310443Z [rank3]:[W1204 13:38:45.686842875 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2310612Z [rank2]:[W1204 13:38:45.685913226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2310788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2311043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2311207Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2311577Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2311778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2311882Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2311979Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2312076Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2312078Z 2025-12-04T13:44:26.2312314Z [rank2]:[W1204 13:38:45.687032841 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2312485Z [rank1]:[W1204 13:38:46.296240131 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2312659Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2312914Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2313078Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2313453Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2313664Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2313778Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2313873Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2313980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2313982Z 2025-12-04T13:44:26.2314216Z [rank1]:[W1204 13:38:46.298784006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2314390Z [rank2]:[W1204 13:38:46.687100650 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2314565Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2314821Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2314984Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2315350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2315552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2315657Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2315753Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2315851Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2315853Z 2025-12-04T13:44:26.2316086Z [rank2]:[W1204 13:38:46.688315483 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2316255Z [rank3]:[W1204 13:38:46.687028602 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2316432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2316689Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2316852Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2317220Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2317445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2317625Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2317719Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2317816Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2317818Z 2025-12-04T13:44:26.2318064Z [rank3]:[W1204 13:38:46.688351163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2318235Z [rank1]:[W1204 13:38:47.298920553 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2318409Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2318669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2318834Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2319199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2319401Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2319507Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2319603Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2319697Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2319699Z 2025-12-04T13:44:26.2319933Z [rank1]:[W1204 13:38:47.301520676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2320103Z [rank3]:[W1204 13:38:47.688466371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2320277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2320533Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2320695Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2321068Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2321281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2321387Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2321509Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2321604Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2321606Z 2025-12-04T13:44:26.2321848Z [rank3]:[W1204 13:38:47.689724263 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2322018Z [rank2]:[W1204 13:38:47.688466371 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2322195Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2322449Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2322613Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2322980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2323184Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2323289Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2323385Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2323482Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2323484Z 2025-12-04T13:44:26.2323719Z [rank2]:[W1204 13:38:47.689857110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2323890Z [rank1]:[W1204 13:38:48.301660114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2324064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2324318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2324481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2324847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2325048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2325163Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2325259Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2325382Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2325385Z 2025-12-04T13:44:26.2325620Z [rank1]:[W1204 13:38:48.304007992 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2325803Z [rank2]:[W1204 13:38:48.689951159 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2325977Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2326232Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2326394Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2326763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2326964Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2327068Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2327164Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2327260Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2327263Z 2025-12-04T13:44:26.2327435Z [rank3]:[W1204 13:38:48.689878730 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2327646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2327902Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2328065Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2328434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2328639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2328746Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2328842Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2328950Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2328952Z 2025-12-04T13:44:26.2329186Z [rank2]:[W1204 13:38:48.691129803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2329443Z [rank3]:[W1204 13:38:48.691268720 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2329617Z [rank1]:[W1204 13:38:49.304132990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2329806Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2330060Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2330224Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2330590Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2330792Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2330896Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2330992Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2331088Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2331091Z 2025-12-04T13:44:26.2331324Z [rank1]:[W1204 13:38:49.306243084 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2331494Z [rank2]:[W1204 13:38:49.691365519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2331670Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2331926Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2332089Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2332455Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2332659Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2332763Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2332870Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2332965Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2332967Z 2025-12-04T13:44:26.2333148Z [rank3]:[W1204 13:38:49.691398418 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2333332Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2333597Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2333760Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2334132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2334334Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2334440Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2334539Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2334636Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2334637Z 2025-12-04T13:44:26.2334875Z [rank2]:[W1204 13:38:49.692634991 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2335106Z [rank3]:[W1204 13:38:49.692637481 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2335279Z [rank1]:[W1204 13:38:50.306407401 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2335453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2335709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2335876Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2336242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2336446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2336550Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2336646Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2336751Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2336755Z 2025-12-04T13:44:26.2336988Z [rank1]:[W1204 13:38:50.308762820 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2337184Z [rank2]:[W1204 13:38:50.692767639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2337359Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2337664Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2337829Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2338203Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2338408Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2338513Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2338610Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2338706Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2338708Z 2025-12-04T13:44:26.2338942Z [rank2]:[W1204 13:38:50.694903132 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2339111Z [rank3]:[W1204 13:38:50.692783708 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2339288Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2339544Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2339710Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2340082Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2340285Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2340391Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2340487Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2340583Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2340585Z 2025-12-04T13:44:26.2340839Z [rank3]:[W1204 13:38:50.694952291 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2341024Z [rank1]:[W1204 13:38:51.308934277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2341213Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2341468Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2341641Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2342008Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2342211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2342315Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2342415Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2342512Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2342514Z 2025-12-04T13:44:26.2342747Z [rank1]:[W1204 13:38:51.310852325 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2342919Z [rank2]:[W1204 13:38:51.695021440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2343094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2343351Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2343514Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2343887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2344091Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2344195Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2344292Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2344388Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2344390Z 2025-12-04T13:44:26.2344624Z [rank2]:[W1204 13:38:51.696712083 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2344804Z [rank3]:[W1204 13:38:51.695068529 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2344991Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2345256Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2345430Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2345797Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2345997Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2346107Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2346202Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2346297Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2346299Z 2025-12-04T13:44:26.2346531Z [rank3]:[W1204 13:38:51.697193173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2346702Z [rank1]:[W1204 13:38:52.311029632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2346876Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2347132Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2347294Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2347701Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2347902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2348008Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2348104Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2348201Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2348202Z 2025-12-04T13:44:26.2348437Z [rank1]:[W1204 13:38:52.313410320 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2348607Z [rank2]:[W1204 13:38:52.696835952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2348796Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2349066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2349242Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2349622Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2349825Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2349929Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2350026Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2350122Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2350124Z 2025-12-04T13:44:26.2350357Z [rank2]:[W1204 13:38:52.699106962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2350530Z [rank3]:[W1204 13:38:52.697371700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2350705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2350961Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2351124Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2351493Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2351694Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2351799Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2351894Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2351992Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2351994Z 2025-12-04T13:44:26.2352227Z [rank3]:[W1204 13:38:52.699654260 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2352398Z [rank1]:[W1204 13:38:53.313533838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2352585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2352849Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2353025Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2353405Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2353607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2353712Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2353808Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2353906Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2353908Z 2025-12-04T13:44:26.2354141Z [rank1]:[W1204 13:38:53.315721900 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2354313Z [rank2]:[W1204 13:38:53.699257900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2354486Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2354742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2354906Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2355276Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2355479Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2355583Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2355683Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2355780Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2355783Z 2025-12-04T13:44:26.2356016Z [rank2]:[W1204 13:38:53.700471843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2356186Z [rank3]:[W1204 13:38:53.699792168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2356361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2356631Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2356805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2357183Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2357393Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2357546Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2357642Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2357738Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2357739Z 2025-12-04T13:44:26.2357973Z [rank3]:[W1204 13:38:53.700953783 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2358144Z [rank1]:[W1204 13:38:54.315847679 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2358319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2358574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2358739Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2359107Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2359309Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2359414Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2359509Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2359606Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2359607Z 2025-12-04T13:44:26.2359841Z [rank1]:[W1204 13:38:54.317530342 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2360011Z [rank3]:[W1204 13:38:54.701065042 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2360187Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2360442Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2360622Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2361009Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2361241Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2361346Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2361445Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2361541Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2361542Z 2025-12-04T13:44:26.2361777Z [rank3]:[W1204 13:38:54.702779714 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2361946Z [rank2]:[W1204 13:38:54.700673950 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2362121Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2362377Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2362540Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2362910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2363114Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2363221Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2363316Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2363415Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2363417Z 2025-12-04T13:44:26.2363650Z [rank2]:[W1204 13:38:54.702828763 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2363820Z [rank1]:[W1204 13:38:55.317671440 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2363995Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2364249Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2364422Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2364798Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2365012Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2365127Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2365223Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2365320Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2365324Z 2025-12-04T13:44:26.2365554Z [rank1]:[W1204 13:38:55.319572789 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2365728Z [rank2]:[W1204 13:38:55.702951252 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2365903Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2366158Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2366320Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2366687Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2366890Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2366993Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2367091Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2367187Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2367189Z 2025-12-04T13:44:26.2367362Z [rank3]:[W1204 13:38:55.702951262 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2367573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2367829Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2367991Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2368358Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2368574Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2368692Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2368810Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2368905Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2368906Z 2025-12-04T13:44:26.2369155Z [rank2]:[W1204 13:38:55.704160185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2369386Z [rank3]:[W1204 13:38:55.704160565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2370163Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/lu/cluh5g7rljnlwawh7tlemcc6bhzbbvmw7vnczfpfcwecjseznjbo.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2370322Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2370549Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2370716Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2371013Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2371155Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2371423Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2371571Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2371834Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2371999Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2372278Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2372420Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2372713Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2372924Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2373256Z [rank2]:E1204 13:38:56.036000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2374012Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/uz/cuznbcky4xntnpyuivp7wmxi77l7hx355rfo4iwmbqgc46q2z2bt.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2374168Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2374391Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2374559Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2374850Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2374990Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2375254Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2375399Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2375661Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2375824Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2376099Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2376241Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2376524Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2376726Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2377056Z [rank1]:E1204 13:38:56.118000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2377872Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/lv/clvxgoofrug4jv7h5pystanfcn24y5ulsrsmqjbfxpnnd3mqzgz3.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2378037Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2378258Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2378420Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2378710Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2378851Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2379114Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2379259Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2379519Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2379683Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2379957Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2380097Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2380379Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2380580Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2380900Z [rank1]:E1204 13:38:56.119000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2381668Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/2u/c2u5uz52kxlskfzhryur3m57pk3dhkqfjyoh4khlqqcld4hyqxih.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2386202Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2386460Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2386625Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2386920Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2387063Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2387328Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2387522Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2387784Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2387949Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2388223Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2388366Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2388646Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2388850Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2389172Z [rank1]:E1204 13:38:56.120000 96066 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2389658Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.2389709Z current_size = base.storage().size() 2025-12-04T13:44:26.2390496Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/nw/cnwcmgwusr53jcmwp2vxpk5ouzhr4waunjc4qgmkxkfvfjp56cgu.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2390683Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2390921Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2391083Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2391379Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2391520Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2391784Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2391931Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2392189Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2392354Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2392629Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2392771Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2393051Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2393252Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2393571Z [rank2]:E1204 13:38:56.140000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2394318Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/4x/c4xvxjcm6iftdqtm6hh5ara7ctcj3yqjm3fvki4ouq7pkmug2efo.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2394480Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2394711Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2394886Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2395190Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2395330Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2395593Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2395738Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2395998Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2396162Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2396434Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2396576Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2396856Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2397060Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2397380Z [rank2]:E1204 13:38:56.147000 96067 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2397904Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.2397954Z current_size = base.storage().size() 2025-12-04T13:44:26.2398696Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/7r/c7r3ssxpd3a7yqrbtirlrg775lnsxenjoewuzrq3sx2fhfhm3sbw.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2398863Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2399101Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2399275Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2399585Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2399723Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2399985Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2400129Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2400389Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2400551Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2400825Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2400967Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2401246Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2401446Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2401767Z [rank0]:E1204 13:38:56.198000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2402514Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/zt/cztuugpvwhycwmpb7wdrwmwxwayymfw5oeo53757nueusp3grxq6.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2402667Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2402886Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2403058Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2403357Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2403508Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2403780Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2403924Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2404190Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2404353Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2404626Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2404766Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2405047Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2405247Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2405564Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2406309Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Exception No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_jenkins/eo/ceoosp4hvvsooonjt2vddh66lxwlm7utk46ucxi7x4wxj2lmkbpy.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8) 2025-12-04T13:44:26.2406464Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] Traceback (most recent call last): 2025-12-04T13:44:26.2406684Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2025-12-04T13:44:26.2406846Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] result = self.fn(*self.args, **self.kwargs) 2025-12-04T13:44:26.2407137Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 3255, in precompile_with_captured_stdout 2025-12-04T13:44:26.2407295Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] choice.precompile() 2025-12-04T13:44:26.2407606Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py", line 2289, in precompile 2025-12-04T13:44:26.2407762Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self.bmreq.precompile() 2025-12-04T13:44:26.2408032Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/autotune_process.py", line 677, in precompile 2025-12-04T13:44:26.2408195Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] getattr(mod, self.kernel_name).precompile() 2025-12-04T13:44:26.2408470Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 444, in precompile 2025-12-04T13:44:26.2408611Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] self._make_launchers() 2025-12-04T13:44:26.2408893Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 613, in _make_launchers 2025-12-04T13:44:26.2409093Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") 2025-12-04T13:44:26.2409411Z [rank0]:E1204 13:38:56.199000 96065 site-packages/torch/_inductor/select_algorithm.py:3323] [0/0_1] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help. 2025-12-04T13:44:26.2409892Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.2409940Z current_size = base.storage().size() 2025-12-04T13:44:26.2410115Z [rank1]:[W1204 13:38:56.319748946 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2410293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2410558Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2410724Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2411101Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2411306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2411417Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2411532Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2411630Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2411633Z 2025-12-04T13:44:26.2411883Z [rank1]:[W1204 13:38:56.322113525 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2412066Z [rank3]:[W1204 13:38:56.704318463 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2412253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2412511Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2412676Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2413043Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2413249Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2413358Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2413455Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2413554Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2413556Z 2025-12-04T13:44:26.2413793Z [rank3]:[W1204 13:38:56.705935998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2413964Z [rank2]:[W1204 13:38:56.704318463 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2414140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2414395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2414559Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2414925Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2415128Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2415236Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2415333Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2415448Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2415450Z 2025-12-04T13:44:26.2415695Z [rank2]:[W1204 13:38:56.706351969 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2415753Z Autotune Choices Stats: 2025-12-04T13:44:26.2416232Z {"num_choices": 34, "num_triton_choices": 33, "best_kernel": "mm", "best_time": 0.01831899955868721, "best_triton_pos": 1, "best_triton_time": 0.018598999828100204, "best_triton_kernel": "triton_mm_22", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"} 2025-12-04T13:44:26.2416283Z AUTOTUNE mm(1024x1024, 1024x2048) 2025-12-04T13:44:26.2416328Z strides: [1024, 1], [2048, 1] 2025-12-04T13:44:26.2416381Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.2416421Z mm 0.0183 ms 100.0% 2025-12-04T13:44:26.2416663Z triton_mm_22 0.0186 ms 98.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2416899Z triton_mm_11 0.0188 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2417131Z triton_mm_17 0.0196 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2417362Z triton_mm_12 0.0217 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2417631Z triton_mm_16 0.0228 ms 80.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2417865Z triton_mm_4 0.0244 ms 75.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2418100Z triton_mm_19 0.0248 ms 73.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2418330Z triton_mm_21 0.0255 ms 71.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2418565Z triton_mm_27 0.0256 ms 71.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2418698Z SingleProcess AUTOTUNE benchmarking takes 0.6731 seconds and 0.0033 seconds precompiling for 34 choices 2025-12-04T13:44:26.2418871Z [rank1]:[W1204 13:38:57.322285903 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2419046Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2419303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2419483Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2419866Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2420099Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2420206Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2420303Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2420399Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2420401Z 2025-12-04T13:44:26.2420639Z [rank1]:[W1204 13:38:57.323992505 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2420810Z [rank3]:[W1204 13:38:57.706120506 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2420987Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2421243Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2421406Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2421774Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2421977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2422083Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2422178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2422275Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2422277Z 2025-12-04T13:44:26.2422511Z [rank3]:[W1204 13:38:57.707421047 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2422681Z [rank2]:[W1204 13:38:57.706481308 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2422855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2423110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2423284Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2423660Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2423874Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2423990Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2424087Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2424186Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2424188Z 2025-12-04T13:44:26.2424420Z [rank2]:[W1204 13:38:57.708453564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2424592Z [rank1]:[W1204 13:38:58.324149743 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2424766Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2425020Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2425184Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2425549Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2425751Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2425855Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2425950Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2426045Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2426048Z 2025-12-04T13:44:26.2426282Z [rank1]:[W1204 13:38:58.325378437 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2426455Z [rank3]:[W1204 13:38:58.707596845 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2426629Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2426889Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2427050Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2427438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2427682Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2427787Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2427896Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2427993Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2427996Z 2025-12-04T13:44:26.2428229Z [rank3]:[W1204 13:38:58.708917096 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2428398Z [rank2]:[W1204 13:38:58.708581234 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2428574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2428834Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2428997Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2429362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2429566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2429671Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2429766Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2429863Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2429865Z 2025-12-04T13:44:26.2430098Z [rank2]:[W1204 13:38:58.710239127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2430269Z [rank1]:[W1204 13:38:59.325543585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2430444Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2430700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2430864Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2431230Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2431471Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2431598Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2431694Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2431804Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2431806Z 2025-12-04T13:44:26.2432040Z [rank1]:[W1204 13:38:59.328008061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2432212Z [rank3]:[W1204 13:38:59.709038766 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2432386Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2432642Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2432803Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2433172Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2433372Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2433477Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2433573Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2433668Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2433670Z 2025-12-04T13:44:26.2433904Z [rank3]:[W1204 13:38:59.710679029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2434073Z [rank2]:[W1204 13:38:59.710335537 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2434248Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2434503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2434666Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2435032Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2435249Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2435382Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2435477Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2435573Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2435575Z 2025-12-04T13:44:26.2435817Z [rank2]:[W1204 13:38:59.712300164 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2435988Z [rank1]:[W1204 13:39:00.328135380 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2436162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2436417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2436579Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2436944Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2437146Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2437250Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2437346Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2437441Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2437444Z 2025-12-04T13:44:26.2437718Z [rank1]:[W1204 13:39:00.330520338 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2437888Z [rank3]:[W1204 13:39:00.710784839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2438061Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2438317Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2438479Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2438845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2439058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2439163Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2439272Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2439384Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2439386Z 2025-12-04T13:44:26.2439635Z [rank3]:[W1204 13:39:00.711957764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2439805Z [rank2]:[W1204 13:39:00.712399664 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2439981Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2440235Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2440398Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2440764Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2440963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2441068Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2441163Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2441262Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2441264Z 2025-12-04T13:44:26.2441495Z [rank2]:[W1204 13:39:00.714521897 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2441669Z [rank1]:[W1204 13:39:01.330653268 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2441843Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2442098Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2442262Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2442628Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2442829Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2442943Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2443038Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2443144Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2443158Z 2025-12-04T13:44:26.2443390Z [rank1]:[W1204 13:39:01.333060905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2443571Z [rank3]:[W1204 13:39:01.712086064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2443745Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2444009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2444173Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2444539Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2444741Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2444847Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2444942Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2445036Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2445040Z 2025-12-04T13:44:26.2445273Z [rank3]:[W1204 13:39:01.713320977 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2445441Z [rank2]:[W1204 13:39:01.714643628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2445615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2445869Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2446033Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2446402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2446602Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2446705Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2446811Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2446907Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2446909Z 2025-12-04T13:44:26.2447153Z [rank2]:[W1204 13:39:01.716709032 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2447333Z [rank1]:[W1204 13:39:02.333192124 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2447573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2447830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2447994Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2448362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2448565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2448668Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2448764Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2448859Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2448861Z 2025-12-04T13:44:26.2449094Z [rank1]:[W1204 13:39:02.335116092 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2449264Z [rank3]:[W1204 13:39:02.713519745 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2449438Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2449693Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2449855Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2450224Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2450425Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2450532Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2450627Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2450746Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2450748Z 2025-12-04T13:44:26.2451000Z [rank3]:[W1204 13:39:02.715332405 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2451183Z [rank2]:[W1204 13:39:02.716809472 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2451358Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2451622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2451787Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2452154Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2452356Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2452461Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2452556Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2452654Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2452657Z 2025-12-04T13:44:26.2452890Z [rank2]:[W1204 13:39:02.718887887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2453062Z [rank1]:[W1204 13:39:03.335258801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2453235Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2453489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2453655Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2454021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2454223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2454326Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2454423Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2454518Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2454533Z 2025-12-04T13:44:26.2454768Z [rank1]:[W1204 13:39:03.337759867 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2454950Z [rank3]:[W1204 13:39:03.715960904 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2455134Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2455403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2455566Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2455934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2456135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2456239Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2456335Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2456431Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2456433Z 2025-12-04T13:44:26.2456665Z [rank3]:[W1204 13:39:03.718197615 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2456835Z [rank2]:[W1204 13:39:03.718981857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2457012Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2457269Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2457432Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2457845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2458047Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2458152Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2458247Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2458345Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2458347Z 2025-12-04T13:44:26.2458579Z [rank2]:[W1204 13:39:03.722415942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2458768Z [rank1]:[W1204 13:39:04.337893886 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2458957Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2459227Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2459402Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2459768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2459973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2460078Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2460173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2460269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2460272Z 2025-12-04T13:44:26.2460503Z [rank1]:[W1204 13:39:04.340381462 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2460673Z [rank3]:[W1204 13:39:04.718351804 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2460847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2461103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2461265Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2461633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2461836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2461941Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2462036Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2462131Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2462133Z 2025-12-04T13:44:26.2462366Z [rank3]:[W1204 13:39:04.719989288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2462548Z [rank2]:[W1204 13:39:04.722528882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2462723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2463000Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2463164Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2463544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2463747Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2463853Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2463948Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2464044Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2464045Z 2025-12-04T13:44:26.2464278Z [rank2]:[W1204 13:39:04.724325283 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2464448Z [rank1]:[W1204 13:39:05.340528071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2464623Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2464877Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2465040Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2465405Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2465607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2465713Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2465809Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2465904Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2465907Z 2025-12-04T13:44:26.2466139Z [rank1]:[W1204 13:39:05.342981527 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2466310Z [rank3]:[W1204 13:39:05.720167156 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2466500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2466765Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2466938Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2467316Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2467563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2467667Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2467764Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2467860Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2467863Z 2025-12-04T13:44:26.2468098Z [rank3]:[W1204 13:39:05.722050115 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2468267Z [rank2]:[W1204 13:39:05.724425973 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2468441Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2468696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2468862Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2469232Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2469434Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2469539Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2469633Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2469733Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2469734Z 2025-12-04T13:44:26.2469967Z [rank2]:[W1204 13:39:05.727334779 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2470138Z [rank1]:[W1204 13:39:06.343090778 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2470312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2470580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2470770Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2471148Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2471350Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2471454Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2471548Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2471645Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2471648Z 2025-12-04T13:44:26.2471879Z [rank1]:[W1204 13:39:06.345575043 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2472049Z [rank3]:[W1204 13:39:06.722162236 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2472223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2472485Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2472647Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2473013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2473216Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2473320Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2473416Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2473512Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2473515Z 2025-12-04T13:44:26.2473749Z [rank3]:[W1204 13:39:06.724064864 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2473918Z [rank2]:[W1204 13:39:06.727414060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2474093Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2474346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2474521Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2474909Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2475120Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2475225Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2475321Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2475417Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2475419Z 2025-12-04T13:44:26.2475652Z [rank2]:[W1204 13:39:06.729237950 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2475823Z [rank1]:[W1204 13:39:07.345691343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2475997Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2476251Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2476414Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2476785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2476989Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2477093Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2477188Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2477284Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2477286Z 2025-12-04T13:44:26.2477570Z [rank1]:[W1204 13:39:07.347881515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2477741Z [rank3]:[W1204 13:39:07.724221183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2477915Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2478170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2478346Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2478725Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2478943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2479066Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2479163Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2479261Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2479262Z 2025-12-04T13:44:26.2479496Z [rank3]:[W1204 13:39:07.726499503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2479667Z [rank2]:[W1204 13:39:07.729351851 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2479842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2480099Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2480263Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2480631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2480834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2480939Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2481034Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2481134Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2481136Z 2025-12-04T13:44:26.2481374Z [rank2]:[W1204 13:39:07.731579592 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2481545Z [rank1]:[W1204 13:39:08.348077084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2481721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2481977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2482140Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2482532Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2482743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2482847Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2482953Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2483050Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2483054Z 2025-12-04T13:44:26.2483288Z [rank1]:[W1204 13:39:08.350418673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2483460Z [rank3]:[W1204 13:39:08.726658753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2483635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2483891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2484052Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2484420Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2484622Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2484725Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2484821Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2484916Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2484918Z 2025-12-04T13:44:26.2485153Z [rank3]:[W1204 13:39:08.728669849 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2485323Z [rank2]:[W1204 13:39:08.731694682 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2485499Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2485756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2485919Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2486299Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2486512Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2486626Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2486722Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2486830Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2486832Z 2025-12-04T13:44:26.2487064Z [rank2]:[W1204 13:39:08.732836627 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2487237Z [rank1]:[W1204 13:39:09.350553533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2487413Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2487723Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2487887Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2488252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2488455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2488561Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2488656Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2488754Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2488756Z 2025-12-04T13:44:26.2488988Z [rank1]:[W1204 13:39:09.353006909 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2489158Z [rank3]:[W1204 13:39:09.728808349 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2489332Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2489588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2489751Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2490121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2490351Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2490469Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2490565Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2490660Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2490662Z 2025-12-04T13:44:26.2490908Z [rank3]:[W1204 13:39:09.730681318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2491078Z [rank2]:[W1204 13:39:09.732926729 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2491253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2491512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2491675Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2492049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2492252Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2492357Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2492452Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2492550Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2492551Z 2025-12-04T13:44:26.2492784Z [rank2]:[W1204 13:39:09.734097713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2492954Z [rank1]:[W1204 13:39:10.353118110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2493130Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2493385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2493549Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2493917Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2494131Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2494237Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2494368Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2494465Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2494467Z 2025-12-04T13:44:26.2494709Z [rank1]:[W1204 13:39:10.355327121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2494880Z [rank3]:[W1204 13:39:10.730879257 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2495055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2495312Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2495475Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2495843Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2496045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2496148Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2496245Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2496342Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2496343Z 2025-12-04T13:44:26.2496580Z [rank3]:[W1204 13:39:10.733011640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2496749Z [rank2]:[W1204 13:39:10.734203324 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2496926Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2497182Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2497344Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2497758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2497960Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2498079Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2498173Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2498297Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2498299Z 2025-12-04T13:44:26.2498533Z [rank2]:[W1204 13:39:10.735863977 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2498718Z [rank1]:[W1204 13:39:11.355466671 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2498893Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2499148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2499311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2499677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2499880Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2499985Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2500079Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2500177Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2500180Z 2025-12-04T13:44:26.2500411Z [rank1]:[W1204 13:39:11.357951777 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2500581Z [rank3]:[W1204 13:39:11.733188869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2500755Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2501015Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2501178Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2501544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2501746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2501861Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2501957Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2502052Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2502064Z 2025-12-04T13:44:26.2502310Z [rank3]:[W1204 13:39:11.735578157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2502481Z [rank2]:[W1204 13:39:11.735955049 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2502669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2502929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2503094Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2503465Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2503668Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2503773Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2503869Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2503968Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2503970Z 2025-12-04T13:44:26.2504203Z [rank2]:[W1204 13:39:11.737331418 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2504374Z [rank1]:[W1204 13:39:12.358063578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2504550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2504805Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2504968Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2505336Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2505539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2505644Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2505752Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2505847Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2505849Z 2025-12-04T13:44:26.2506091Z [rank1]:[W1204 13:39:12.360423976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2506274Z [rank3]:[W1204 13:39:12.735750946 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2506458Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2506717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2506879Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2507246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2507451Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2507597Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2507694Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2507791Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2507793Z 2025-12-04T13:44:26.2508028Z [rank3]:[W1204 13:39:12.737056358 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2508200Z [rank2]:[W1204 13:39:12.737423060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2508374Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2508629Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2508792Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2509162Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2509364Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2509470Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2509566Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2509663Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2509679Z 2025-12-04T13:44:26.2509914Z [rank2]:[W1204 13:39:12.738546345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2510106Z [rank1]:[W1204 13:39:13.360559757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2510295Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2510565Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2510728Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2511097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2511299Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2511404Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2511499Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2511596Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2511599Z 2025-12-04T13:44:26.2511833Z [rank1]:[W1204 13:39:13.363056882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2512007Z [rank3]:[W1204 13:39:13.737210078 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2512182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2512438Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2512600Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2512967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2513170Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2513274Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2513369Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2513465Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2513466Z 2025-12-04T13:44:26.2513712Z [rank3]:[W1204 13:39:13.738511180 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2513893Z [rank2]:[W1204 13:39:13.738638307 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2514085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2514340Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2514513Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2514885Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2515088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2515195Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2515290Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2515388Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2515389Z 2025-12-04T13:44:26.2515626Z [rank2]:[W1204 13:39:13.739771052 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2515797Z [rank1]:[W1204 13:39:14.363165363 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2515973Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2516230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2516395Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2516759Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2516963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2517068Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2517163Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2517259Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2517262Z 2025-12-04T13:44:26.2517531Z [rank1]:[W1204 13:39:14.365705247 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2517715Z [rank3]:[W1204 13:39:14.738655610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2517902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2518174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2518351Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2518720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2518922Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2519028Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2519124Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2519220Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2519222Z 2025-12-04T13:44:26.2519456Z [rank3]:[W1204 13:39:14.739885723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2519628Z [rank2]:[W1204 13:39:14.739871943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2519802Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2520057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2520220Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2520591Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2520795Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2520901Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2520997Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2521095Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2521096Z 2025-12-04T13:44:26.2521331Z [rank2]:[W1204 13:39:14.741041468 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2521501Z [rank1]:[W1204 13:39:15.365818309 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2521689Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2521955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2522129Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2522505Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2522709Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2522814Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2522912Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2523009Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2523010Z 2025-12-04T13:44:26.2523244Z [rank1]:[W1204 13:39:15.368247935 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2523414Z [rank3]:[W1204 13:39:15.740102432 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2523587Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2523844Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2524007Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2524372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2524575Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2524679Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2524777Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2524874Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2524876Z 2025-12-04T13:44:26.2525111Z [rank3]:[W1204 13:39:15.741560150 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2525282Z [rank2]:[W1204 13:39:15.741381924 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2525477Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2525742Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2525915Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2526293Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2526494Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2526600Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2526695Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2526795Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2526797Z 2025-12-04T13:44:26.2527032Z [rank2]:[W1204 13:39:15.742578378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2527203Z [rank1]:[W1204 13:39:16.368374636 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2527380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2527677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2527841Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2528211Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2528413Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2528519Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2528614Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2528712Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2528715Z 2025-12-04T13:44:26.2528948Z [rank1]:[W1204 13:39:16.370704145 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2529119Z [rank3]:[W1204 13:39:16.741739240 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2529293Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2529568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2529745Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2530128Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2530342Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2530450Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2530546Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2530641Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2530643Z 2025-12-04T13:44:26.2530879Z [rank3]:[W1204 13:39:16.743878583 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2531049Z [rank2]:[W1204 13:39:16.742676669 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2531223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2531479Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2531643Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2532015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2532218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2532323Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2532421Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2532517Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2532519Z 2025-12-04T13:44:26.2532753Z [rank2]:[W1204 13:39:16.744079989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2532923Z [rank1]:[W1204 13:39:17.370836706 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2533100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2533354Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2533533Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2533915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2534138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2534244Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2534339Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2534435Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2534437Z 2025-12-04T13:44:26.2534670Z [rank1]:[W1204 13:39:17.373324842 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2534842Z [rank2]:[W1204 13:39:17.744164321 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2535016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2535273Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2535437Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2535804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2536011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2536116Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2536212Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2536309Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2536311Z 2025-12-04T13:44:26.2536544Z [rank2]:[W1204 13:39:17.745394564 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2536716Z [rank3]:[W1204 13:39:17.744140071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2536890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2537146Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2537321Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2537736Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2537949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2538079Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2538177Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2538274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2538277Z 2025-12-04T13:44:26.2538514Z [rank3]:[W1204 13:39:17.746396212 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2538684Z [rank1]:[W1204 13:39:18.373457793 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2538860Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2539113Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2539275Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2539643Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2539844Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2539948Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2540044Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2540141Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2540143Z 2025-12-04T13:44:26.2540377Z [rank1]:[W1204 13:39:18.376263312 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2540550Z [rank2]:[W1204 13:39:18.745531385 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2540725Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2540983Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2541145Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2541526Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2541752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2541856Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2541964Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2542060Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2542062Z 2025-12-04T13:44:26.2542298Z [rank2]:[W1204 13:39:18.747342045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2542473Z [rank3]:[W1204 13:39:18.746535483 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2542650Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2542905Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2543070Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2543435Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2543636Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2543743Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2543839Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2543936Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2543938Z 2025-12-04T13:44:26.2544173Z [rank3]:[W1204 13:39:18.748307334 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2544343Z [rank1]:[W1204 13:39:19.376374363 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2544519Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2544776Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2544940Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2545305Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2545529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2545646Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2545741Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2545837Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2545850Z 2025-12-04T13:44:26.2546084Z [rank1]:[W1204 13:39:19.379031985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2546257Z [rank2]:[W1204 13:39:19.747497766 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2546432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2546688Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2546854Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2547223Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2547426Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2547584Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2547681Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2547778Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2547780Z 2025-12-04T13:44:26.2548016Z [rank2]:[W1204 13:39:19.748987503 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2548191Z [rank3]:[W1204 13:39:19.748435306 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2548365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2548623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2548787Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2549158Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2549374Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2549491Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2549600Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2549695Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2549697Z 2025-12-04T13:44:26.2549943Z [rank3]:[W1204 13:39:19.751242984 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2550113Z [rank1]:[W1204 13:39:20.379121937 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2550290Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2550547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2550711Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2551078Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2551282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2551388Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2551484Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2551582Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2551584Z 2025-12-04T13:44:26.2551818Z [rank1]:[W1204 13:39:20.381570704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2551863Z Autotune Choices Stats: 2025-12-04T13:44:26.2552245Z {"num_choices": 34, "num_triton_choices": 33, "best_kernel": "triton_mm_94", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.018400000408291817, "best_triton_pos": 0} 2025-12-04T13:44:26.2552292Z AUTOTUNE mm(2048x1024, 1024x1024) 2025-12-04T13:44:26.2552340Z strides: [1024, 1], [1024, 1] 2025-12-04T13:44:26.2552391Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.2552634Z triton_mm_94 0.0184 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2552672Z mm 0.0188 ms 97.7% 2025-12-04T13:44:26.2552907Z triton_mm_83 0.0188 ms 97.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2553152Z triton_mm_89 0.0196 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2553394Z triton_mm_93 0.0205 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2553646Z triton_mm_88 0.0209 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2553887Z triton_mm_84 0.0210 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2554119Z triton_mm_92 0.0238 ms 77.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T13:44:26.2554346Z triton_mm_76 0.0240 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2554578Z triton_mm_86 0.0243 ms 75.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T13:44:26.2554712Z SingleProcess AUTOTUNE benchmarking takes 24.3935 seconds and 0.2955 seconds precompiling for 34 choices 2025-12-04T13:44:26.2554888Z [rank2]:[W1204 13:39:20.749117705 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2555066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2555323Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2555490Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2555862Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2556066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2556172Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2556272Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2556368Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2556371Z 2025-12-04T13:44:26.2556608Z [rank2]:[W1204 13:39:20.751452864 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2556780Z [rank3]:[W1204 13:39:20.751372885 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2556966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2557234Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2557410Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2557849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2558055Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2558160Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2558257Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2558354Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2558356Z 2025-12-04T13:44:26.2558591Z [rank3]:[W1204 13:39:20.752836063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2558637Z Autotune Choices Stats: 2025-12-04T13:44:26.2559104Z {"num_choices": 34, "num_triton_choices": 33, "best_kernel": "mm", "best_time": 0.020958999171853065, "best_triton_pos": 1, "best_triton_time": 0.0247189998626709, "best_triton_kernel": "triton_mm_48", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"} 2025-12-04T13:44:26.2559152Z AUTOTUNE mm(1024x2048, 2048x1024) 2025-12-04T13:44:26.2559195Z strides: [2048, 1], [1024, 1] 2025-12-04T13:44:26.2559246Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.2559287Z mm 0.0210 ms 100.0% 2025-12-04T13:44:26.2559528Z triton_mm_48 0.0247 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2559764Z triton_mm_45 0.0256 ms 81.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2559993Z triton_mm_47 0.0260 ms 80.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2560220Z triton_mm_40 0.0273 ms 76.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2560449Z triton_mm_43 0.0280 ms 74.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2560680Z triton_mm_58 0.0286 ms 73.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2560924Z triton_mm_53 0.0315 ms 66.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.2561167Z triton_mm_55 0.0340 ms 61.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2561420Z triton_mm_39 0.0358 ms 58.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.2561554Z SingleProcess AUTOTUNE benchmarking takes 25.2118 seconds and 0.0032 seconds precompiling for 34 choices 2025-12-04T13:44:26.2561727Z [rank1]:[W1204 13:39:21.381737824 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2561902Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2562163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2562328Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2562697Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2562900Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2563009Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2563106Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2563203Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2563204Z 2025-12-04T13:44:26.2563441Z [rank1]:[W1204 13:39:21.383853698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2563612Z [rank2]:[W1204 13:39:21.751595335 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2563790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2564048Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2564213Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2564579Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2564793Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2564898Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2565004Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2565113Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2565116Z 2025-12-04T13:44:26.2565361Z [rank2]:[W1204 13:39:21.752783089 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2565534Z [rank3]:[W1204 13:39:21.752957615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2565708Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2565966Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2566130Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2566498Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2566700Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2566805Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2566902Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2566999Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2567001Z 2025-12-04T13:44:26.2567238Z [rank3]:[W1204 13:39:21.754123230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2567407Z [rank1]:[W1204 13:39:22.384281053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2567649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2567905Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2568071Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2568440Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2568641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2568761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2568856Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2568974Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2568990Z 2025-12-04T13:44:26.2569225Z [rank1]:[W1204 13:39:22.386490185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2569409Z [rank2]:[W1204 13:39:22.752923160 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2569586Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2569841Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2570007Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2570375Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2570579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2570686Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2570780Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2570878Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2570881Z 2025-12-04T13:44:26.2571115Z [rank2]:[W1204 13:39:22.754128704 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2571285Z [rank3]:[W1204 13:39:22.754229952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2571461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2571717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2571879Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2572248Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2572453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2572558Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2572666Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2572763Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2572765Z 2025-12-04T13:44:26.2573010Z [rank3]:[W1204 13:39:22.755469385 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2573190Z [rank1]:[W1204 13:39:23.386598567 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2573377Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2573632Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2573794Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2574163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2574366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2574473Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2574569Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2574666Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2574668Z 2025-12-04T13:44:26.2574904Z [rank1]:[W1204 13:39:23.388999204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2575073Z [rank2]:[W1204 13:39:23.754248346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2575249Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2575503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2575668Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2576035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2576240Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2576346Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2576442Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2576550Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2576552Z 2025-12-04T13:44:26.2576795Z [rank2]:[W1204 13:39:23.755477629 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2576980Z [rank3]:[W1204 13:39:23.755597756 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2577154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2577419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2577632Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2578000Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2578204Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2578309Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2578406Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2578502Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2578504Z 2025-12-04T13:44:26.2578739Z [rank3]:[W1204 13:39:23.757189371 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2578910Z [rank1]:[W1204 13:39:24.389128216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2579087Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2579348Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2579511Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2579879Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2580081Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2580186Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2580282Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2580379Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2580396Z 2025-12-04T13:44:26.2580630Z [rank1]:[W1204 13:39:24.391548493 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2580813Z [rank2]:[W1204 13:39:24.755619241 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2581002Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2581270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2581436Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2581805Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2582008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2582115Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2582211Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2582308Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2582310Z 2025-12-04T13:44:26.2582545Z [rank2]:[W1204 13:39:24.756873433 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2582720Z [rank3]:[W1204 13:39:24.757310864 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2582895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2583153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2583315Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2583686Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2583889Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2583993Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2584090Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2584187Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2584189Z 2025-12-04T13:44:26.2584423Z [rank3]:[W1204 13:39:24.758489288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2584613Z [rank1]:[W1204 13:39:25.391696065 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2584798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2585066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2585238Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2585606Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2585809Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2585916Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2586010Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2586111Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2586113Z 2025-12-04T13:44:26.2586346Z [rank1]:[W1204 13:39:25.393936345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2586516Z [rank2]:[W1204 13:39:25.757022335 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2586692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2586946Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2587111Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2587524Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2587729Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2587834Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2587929Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2588027Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2588029Z 2025-12-04T13:44:26.2588263Z [rank2]:[W1204 13:39:25.758327666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2588448Z [rank3]:[W1204 13:39:25.758604820 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2588621Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2588904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2589067Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2589447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2589650Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2589755Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2589852Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2589947Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2589949Z 2025-12-04T13:44:26.2590185Z [rank3]:[W1204 13:39:25.759750365 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2590355Z [rank1]:[W1204 13:39:26.394065368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2590531Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2590787Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2590952Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2591319Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2591521Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2591627Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2591726Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2591823Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2591825Z 2025-12-04T13:44:26.2592060Z [rank1]:[W1204 13:39:26.396523754 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2592230Z [rank2]:[W1204 13:39:26.758478298 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2592418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2592682Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2592859Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2593237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2593440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2593546Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2593642Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2593739Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2593741Z 2025-12-04T13:44:26.2593974Z [rank2]:[W1204 13:39:26.759726301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2594145Z [rank3]:[W1204 13:39:26.759860538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2594319Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2594582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2594746Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2595111Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2595313Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2595417Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2595513Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2595609Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2595612Z 2025-12-04T13:44:26.2595851Z [rank3]:[W1204 13:39:26.761016002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2596023Z [rank1]:[W1204 13:39:27.396639586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2596196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2596463Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2596636Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2597029Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2597231Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2597338Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2597433Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2597574Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2597577Z 2025-12-04T13:44:26.2597812Z [rank1]:[W1204 13:39:27.398831278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2597983Z [rank2]:[W1204 13:39:27.759870292 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2598162Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2598420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2598587Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2598953Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2599156Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2599262Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2599357Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2599456Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2599459Z 2025-12-04T13:44:26.2599691Z [rank2]:[W1204 13:39:27.761095816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2599862Z [rank3]:[W1204 13:39:27.761154644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2600036Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2600294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2600472Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2600864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2601080Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2601185Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2601283Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2601379Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2601380Z 2025-12-04T13:44:26.2601615Z [rank3]:[W1204 13:39:27.763152621 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2601786Z [rank1]:[W1204 13:39:28.398937771 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2601960Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2602215Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2602378Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2602747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2602949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2603054Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2603153Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2603249Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2603251Z 2025-12-04T13:44:26.2603487Z [rank1]:[W1204 13:39:28.401436446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2603656Z [rank2]:[W1204 13:39:28.761239038 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2603832Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2604086Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2604263Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2604645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2604860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2604977Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2605073Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2605171Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2605172Z 2025-12-04T13:44:26.2605407Z [rank2]:[W1204 13:39:28.762452231 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2605578Z [rank3]:[W1204 13:39:28.763269943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2605752Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2606008Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2606171Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2606536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2606740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2606845Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2606943Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2607039Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2607041Z 2025-12-04T13:44:26.2607274Z [rank3]:[W1204 13:39:28.764505186 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2607446Z [rank1]:[W1204 13:39:29.401555629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2607656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2607910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2608072Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2608471Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2608685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2608790Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2608907Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2609003Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2609006Z 2025-12-04T13:44:26.2609244Z [rank1]:[W1204 13:39:29.403992006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2609413Z [rank2]:[W1204 13:39:29.762600653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2609589Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2609843Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2610007Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2610378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2610579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2610684Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2610780Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2610878Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2610881Z 2025-12-04T13:44:26.2611118Z [rank2]:[W1204 13:39:29.764307676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2611292Z [rank3]:[W1204 13:39:29.764608809 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2611467Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2611725Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2611889Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2612266Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2612482Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2612603Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2612700Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2612805Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2612807Z 2025-12-04T13:44:26.2613042Z [rank3]:[W1204 13:39:29.765787843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2613214Z [rank1]:[W1204 13:39:30.404106598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2613390Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2613648Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2613811Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2614176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2614377Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2614484Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2614580Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2614675Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2614678Z 2025-12-04T13:44:26.2614911Z [rank1]:[W1204 13:39:30.406577504 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2615082Z [rank2]:[W1204 13:39:30.764407869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2615257Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2615516Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2615681Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2616049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2616261Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2616390Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2616486Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2616585Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2616586Z 2025-12-04T13:44:26.2616832Z [rank2]:[W1204 13:39:30.765703281 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2617005Z [rank3]:[W1204 13:39:30.765901306 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2617178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2617437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2617646Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2618015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2618218Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2618322Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2618422Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2618517Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2618519Z 2025-12-04T13:44:26.2618755Z [rank3]:[W1204 13:39:30.767058561 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2618928Z [rank1]:[W1204 13:39:31.406685067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2619103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2619360Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2619522Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2619890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2620108Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2620214Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2620340Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2620435Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2620437Z 2025-12-04T13:44:26.2620684Z [rank1]:[W1204 13:39:31.409249431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2620853Z [rank2]:[W1204 13:39:31.765827083 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2621028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2621286Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2621451Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2621819Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2622020Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2622127Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2622224Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2622323Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2622325Z 2025-12-04T13:44:26.2622558Z [rank2]:[W1204 13:39:31.767845029 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2622729Z [rank3]:[W1204 13:39:31.767191153 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2622903Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2623162Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2623325Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2623690Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2623893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2624009Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2624106Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2624223Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2624227Z 2025-12-04T13:44:26.2624461Z [rank3]:[W1204 13:39:31.768795158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2624642Z [rank1]:[W1204 13:39:32.409363264 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2624817Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2625074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2625237Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2625604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2625806Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2625912Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2626008Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2626105Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2626108Z 2025-12-04T13:44:26.2626342Z [rank1]:[W1204 13:39:32.411067637 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2626515Z [rank2]:[W1204 13:39:32.767968742 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2626692Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2626948Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2627112Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2627536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2627738Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2627859Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2627953Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2628050Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2628051Z 2025-12-04T13:44:26.2628324Z [rank2]:[W1204 13:39:32.769548717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2628495Z [rank3]:[W1204 13:39:32.768968610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2628685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2628943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2629107Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2629475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2629677Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2629781Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2629879Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2629974Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2629978Z 2025-12-04T13:44:26.2630211Z [rank3]:[W1204 13:39:32.770738541 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2630385Z [rank1]:[W1204 13:39:33.411262218 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2630558Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2630812Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2630976Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2631348Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2631552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2631657Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2631764Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2631859Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2631861Z 2025-12-04T13:44:26.2632105Z [rank1]:[W1204 13:39:33.413377102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2632285Z [rank2]:[W1204 13:39:33.769676590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2632460Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2632725Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2632892Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2633265Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2633471Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2633578Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2633673Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2633775Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2633777Z 2025-12-04T13:44:26.2634011Z [rank2]:[W1204 13:39:33.771983010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2634184Z [rank3]:[W1204 13:39:33.770857254 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2634358Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2634613Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2634778Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2635143Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2635347Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2635451Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2635548Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2635643Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2635657Z 2025-12-04T13:44:26.2635890Z [rank3]:[W1204 13:39:33.773063606 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2636075Z [rank1]:[W1204 13:39:34.413494535 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2636261Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2636527Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2636690Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2637059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2637260Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2637365Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2637461Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2637599Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2637601Z 2025-12-04T13:44:26.2637835Z [rank1]:[W1204 13:39:34.415897112 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2638006Z [rank2]:[W1204 13:39:34.772218900 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2638183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2638445Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2638611Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2638979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2639183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2639288Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2639383Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2645513Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2645517Z 2025-12-04T13:44:26.2645805Z [rank2]:[W1204 13:39:34.774973410 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2646005Z [rank3]:[W1204 13:39:34.773166280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2646248Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2646517Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2646703Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2647080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2647286Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2647395Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2647553Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2647650Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2647653Z 2025-12-04T13:44:26.2647891Z [rank3]:[W1204 13:39:34.775576547 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2648065Z [rank1]:[W1204 13:39:35.416040775 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2648242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2648498Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2648661Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2649036Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2649244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2649351Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2649447Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2649542Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2649545Z 2025-12-04T13:44:26.2649780Z [rank1]:[W1204 13:39:35.418512541 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2649980Z [rank2]:[W1204 13:39:35.775107263 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2650183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2650452Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2650629Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2650997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2651201Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2651308Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2651404Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2651503Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2651505Z 2025-12-04T13:44:26.2651738Z [rank2]:[W1204 13:39:35.777500810 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2651910Z [rank3]:[W1204 13:39:35.775718370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2652084Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2652340Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2652506Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2652874Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2653078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2653184Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2653281Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2653377Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2653380Z 2025-12-04T13:44:26.2653615Z [rank3]:[W1204 13:39:35.778136036 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2653785Z [rank1]:[W1204 13:39:36.418687063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2653973Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2654239Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2654413Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2654794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2654996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2655100Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2655198Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2655294Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2655295Z 2025-12-04T13:44:26.2655531Z [rank1]:[W1204 13:39:36.420856706 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2655701Z [rank2]:[W1204 13:39:36.777655743 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2655877Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2656132Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2656297Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2656666Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2656868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2656973Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2657069Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2657167Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2657169Z 2025-12-04T13:44:26.2657402Z [rank2]:[W1204 13:39:36.780021161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2657625Z [rank3]:[W1204 13:39:36.778276890 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2657816Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2658089Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2658265Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2658643Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2658845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2658950Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2659047Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2659147Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2659149Z 2025-12-04T13:44:26.2659382Z [rank3]:[W1204 13:39:36.780513051 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2659553Z [rank1]:[W1204 13:39:37.420987799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2659726Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2659984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2660148Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2660519Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2660722Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2660827Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2660923Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2661020Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2661023Z 2025-12-04T13:44:26.2661256Z [rank1]:[W1204 13:39:37.423432945 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2661426Z [rank2]:[W1204 13:39:37.780168154 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2661604Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2661869Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2662044Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2662427Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2662645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2662753Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2662847Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2662943Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2662945Z 2025-12-04T13:44:26.2663180Z [rank2]:[W1204 13:39:37.782537402 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2663351Z [rank3]:[W1204 13:39:37.780662543 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2663527Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2663781Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2663944Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2664311Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2664515Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2664621Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2664716Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2664812Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2664814Z 2025-12-04T13:44:26.2665047Z [rank3]:[W1204 13:39:37.783126439 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2665218Z [rank1]:[W1204 13:39:38.423571148 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2665393Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2665648Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2665820Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2666196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2666421Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2666525Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2666621Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2666718Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2666720Z 2025-12-04T13:44:26.2666954Z [rank1]:[W1204 13:39:38.425588314 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2667127Z [rank2]:[W1204 13:39:38.782684085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2667302Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2667603Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2667767Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2668134Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2668335Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2668440Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2668535Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2668633Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2668635Z 2025-12-04T13:44:26.2668873Z [rank2]:[W1204 13:39:38.784734380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2669045Z [rank3]:[W1204 13:39:38.783276422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2669220Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2669477Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2669654Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2670033Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2670250Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2670366Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2670462Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2670558Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2670561Z 2025-12-04T13:44:26.2670793Z [rank3]:[W1204 13:39:38.785668490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2670964Z [rank1]:[W1204 13:39:39.425725788 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2671140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2671395Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2671556Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2671925Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2672127Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2672231Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2672327Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2672422Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2672425Z 2025-12-04T13:44:26.2672661Z [rank1]:[W1204 13:39:39.428193523 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2672831Z [rank2]:[W1204 13:39:39.784876104 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2673006Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2673264Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2673426Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2673807Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2674029Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2674134Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2674229Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2674334Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2674336Z 2025-12-04T13:44:26.2674569Z [rank2]:[W1204 13:39:39.787383859 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2674740Z [rank3]:[W1204 13:39:39.785821063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2674916Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2675172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2675337Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2675706Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2675911Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2676016Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2676110Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2676208Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2676209Z 2025-12-04T13:44:26.2676442Z [rank3]:[W1204 13:39:39.787997205 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2676614Z [rank1]:[W1204 13:39:40.428368106 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2676789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2677045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2677208Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2677625Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2677857Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2677979Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2678074Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2678170Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2678187Z 2025-12-04T13:44:26.2678421Z [rank1]:[W1204 13:39:40.429815024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2678593Z [rank2]:[W1204 13:39:40.787546592 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2678769Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2679025Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2679188Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2679555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2679759Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2679867Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2679963Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2680064Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2680066Z 2025-12-04T13:44:26.2680300Z [rank2]:[W1204 13:39:40.789022209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2680470Z [rank3]:[W1204 13:39:40.788145888 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2680644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2680899Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2681064Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2681431Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2681645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2681765Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2681871Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2681967Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2681969Z 2025-12-04T13:44:26.2682215Z [rank3]:[W1204 13:39:40.790319811 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2682388Z [rank1]:[W1204 13:39:41.429979497 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2682562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2682819Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2682980Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2683347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2683549Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2683653Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2683749Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2683846Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2683848Z 2025-12-04T13:44:26.2684082Z [rank1]:[W1204 13:39:41.432445143 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2684254Z [rank2]:[W1204 13:39:41.789144303 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2684430Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2684685Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2684849Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2685216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2685430Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2685534Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2685638Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2685747Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2685748Z 2025-12-04T13:44:26.2685983Z [rank2]:[W1204 13:39:41.791464962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2686162Z [rank3]:[W1204 13:39:41.790453404 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2686340Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2686599Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2686762Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2687127Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2687329Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2687435Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2687550Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2687648Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2687651Z 2025-12-04T13:44:26.2687884Z [rank3]:[W1204 13:39:41.793087337 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2688054Z [rank1]:[W1204 13:39:42.432603196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2688228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2688487Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2688652Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2689021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2689222Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2689349Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2689445Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2689539Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2689573Z 2025-12-04T13:44:26.2689806Z [rank1]:[W1204 13:39:42.434232341 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2689989Z [rank2]:[W1204 13:39:42.791567677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2690165Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2690420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2690583Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2690955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2691157Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2691262Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2691357Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2691455Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2691456Z 2025-12-04T13:44:26.2691691Z [rank2]:[W1204 13:39:42.793707250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2691860Z [rank3]:[W1204 13:39:42.793194521 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2692037Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2692292Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2692457Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2692823Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2693027Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2693131Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2693236Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2693332Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2693334Z 2025-12-04T13:44:26.2693586Z [rank3]:[W1204 13:39:42.795824133 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2693766Z [rank1]:[W1204 13:39:43.434425263 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2693950Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2694205Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2694367Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2694735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2694938Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2695043Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2695138Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2695235Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2695236Z 2025-12-04T13:44:26.2695474Z [rank1]:[W1204 13:39:43.436550267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2695643Z [rank2]:[W1204 13:39:43.793832134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2695819Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2696075Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2696237Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2696604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2696804Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2696909Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2697004Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2697112Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2697114Z 2025-12-04T13:44:26.2697353Z [rank2]:[W1204 13:39:43.796111894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2697594Z [rank3]:[W1204 13:39:43.795970307 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2697769Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2698038Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2698202Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2698570Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2698774Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2698879Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2698974Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2699069Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2699072Z 2025-12-04T13:44:26.2699305Z [rank3]:[W1204 13:39:43.798418353 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2699477Z [rank1]:[W1204 13:39:44.436728500 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2699651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2699906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2700069Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2700438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2700641Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2700744Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2700839Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2700934Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2700952Z 2025-12-04T13:44:26.2701185Z [rank1]:[W1204 13:39:44.438946111 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2701363Z [rank2]:[W1204 13:39:44.796214538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2701551Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2701818Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2701981Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2702349Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2702552Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2702656Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2702750Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2702848Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2702850Z 2025-12-04T13:44:26.2703084Z [rank2]:[W1204 13:39:44.798449789 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2703254Z [rank3]:[W1204 13:39:44.798524288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2703429Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2703683Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2703848Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2704224Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2704427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2704532Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2704627Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2704724Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2704726Z 2025-12-04T13:44:26.2704959Z [rank3]:[W1204 13:39:44.800723350 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2705141Z [rank1]:[W1204 13:39:45.439108364 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2705324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2705591Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2705764Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2706132Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2706337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2706441Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2706536Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2706631Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2706632Z 2025-12-04T13:44:26.2706866Z [rank1]:[W1204 13:39:45.440344787 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2707037Z [rank2]:[W1204 13:39:45.798613433 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2707211Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2707522Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2707685Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2708052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2708255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2708362Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2708457Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2708553Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2708555Z 2025-12-04T13:44:26.2708788Z [rank2]:[W1204 13:39:45.800474612 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2708976Z [rank3]:[W1204 13:39:45.800873703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2709150Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2709418Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2709596Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2709979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2710183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2710291Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2710387Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2710484Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2710485Z 2025-12-04T13:44:26.2710720Z [rank3]:[W1204 13:39:45.802945808 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2710890Z [rank1]:[W1204 13:39:46.440517040 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2711065Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2711322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2711487Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2711853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2712055Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2712159Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2712256Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2712353Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2712355Z 2025-12-04T13:44:26.2712594Z [rank1]:[W1204 13:39:46.442892948 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2712769Z [rank2]:[W1204 13:39:46.800634485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2712957Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2713224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2713399Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2713781Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2713982Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2714088Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2714184Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2714281Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2714283Z 2025-12-04T13:44:26.2714515Z [rank2]:[W1204 13:39:46.802752639 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2714685Z [rank3]:[W1204 13:39:46.803368655 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2714863Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2715121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2715287Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2715655Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2715856Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2715963Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2716057Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2716154Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2716156Z 2025-12-04T13:44:26.2716388Z [rank3]:[W1204 13:39:46.805985578 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2716559Z [rank1]:[W1204 13:39:47.443039882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2716732Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2717002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2717177Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2717610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2717813Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2717917Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2718013Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2718108Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2718111Z 2025-12-04T13:44:26.2718344Z [rank1]:[W1204 13:39:47.445402040 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2718514Z [rank2]:[W1204 13:39:47.802911372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2718687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2718943Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2719106Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2719477Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2719679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2719785Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2719882Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2719978Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2719980Z 2025-12-04T13:44:26.2720214Z [rank2]:[W1204 13:39:47.805625363 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2720384Z [rank3]:[W1204 13:39:47.806085073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2720559Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2720814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2720990Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2721369Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2721601Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2721706Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2721801Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2721897Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2721899Z 2025-12-04T13:44:26.2722133Z [rank3]:[W1204 13:39:47.808401772 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2722304Z [rank1]:[W1204 13:39:48.445545814 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2722479Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2722733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2722896Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2723261Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2723464Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2723567Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2723664Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2723760Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2723762Z 2025-12-04T13:44:26.2723996Z [rank1]:[W1204 13:39:48.446779627 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2724169Z [rank2]:[W1204 13:39:48.805744408 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2724342Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2724598Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2724771Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2725149Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2725361Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2725477Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2725574Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2725670Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2725672Z 2025-12-04T13:44:26.2725910Z [rank2]:[W1204 13:39:48.807952069 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2726081Z [rank3]:[W1204 13:39:48.808513937 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2726256Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2726511Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2726675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2727042Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2727243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2727348Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2727443Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2727586Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2727589Z 2025-12-04T13:44:26.2727821Z [rank3]:[W1204 13:39:48.810832026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2727997Z [rank1]:[W1204 13:39:49.446934231 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2728172Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2728428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2728591Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2728985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2729200Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2729302Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2729411Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2729508Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2729510Z 2025-12-04T13:44:26.2729743Z [rank1]:[W1204 13:39:49.449181272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2729914Z [rank2]:[W1204 13:39:49.808064704 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2730088Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2730346Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2730509Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2730878Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2731083Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2731185Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2731282Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2731378Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2731380Z 2025-12-04T13:44:26.2731613Z [rank2]:[W1204 13:39:49.810190557 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2731784Z [rank3]:[W1204 13:39:49.810972100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2731960Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2732215Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2732378Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2732748Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2732970Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2733086Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2733180Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2733287Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2733289Z 2025-12-04T13:44:26.2733521Z [rank3]:[W1204 13:39:49.813131483 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2733692Z [rank1]:[W1204 13:39:50.449355865 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2733866Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2734121Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2734284Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2734654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2734862Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2734966Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2735064Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2735158Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2735161Z 2025-12-04T13:44:26.2735394Z [rank1]:[W1204 13:39:50.451455639 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2735564Z [rank2]:[W1204 13:39:50.810301843 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2735738Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2735995Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2736156Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2736525Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2736743Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2736857Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2736974Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2737069Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2737071Z 2025-12-04T13:44:26.2737315Z [rank2]:[W1204 13:39:50.812512504 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2737532Z [rank3]:[W1204 13:39:50.813274127 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2737707Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2737962Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2738125Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2738493Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2738697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2738802Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2738897Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2738993Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2738996Z 2025-12-04T13:44:26.2739232Z [rank3]:[W1204 13:39:50.815635876 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2739404Z [rank1]:[W1204 13:39:51.451623833 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2739579Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2739835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2739997Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2740362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2740579Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2740682Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2740791Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2740900Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2740903Z 2025-12-04T13:44:26.2741151Z [rank1]:[W1204 13:39:51.453991411 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2741324Z [rank2]:[W1204 13:39:51.812583190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2741500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2741758Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2741922Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2742289Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2742493Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2742597Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2742692Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2742789Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2742791Z 2025-12-04T13:44:26.2743025Z [rank2]:[W1204 13:39:51.814750883 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2743195Z [rank3]:[W1204 13:39:51.815738621 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2743371Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2743630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2743795Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2744161Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2744361Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2744475Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2744569Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2744677Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2744689Z 2025-12-04T13:44:26.2744920Z [rank3]:[W1204 13:39:51.817877324 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2745102Z [rank1]:[W1204 13:39:52.454116356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2745277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2745534Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2745699Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2746068Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2746271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2746375Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2746470Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2746565Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2746570Z 2025-12-04T13:44:26.2746802Z [rank1]:[W1204 13:39:52.456561463 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2746973Z [rank2]:[W1204 13:39:52.814902777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2747146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2747403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2747605Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2747978Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2748181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2748284Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2748393Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2748489Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2748490Z 2025-12-04T13:44:26.2748745Z [rank2]:[W1204 13:39:52.818367991 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2748928Z [rank3]:[W1204 13:39:52.818041158 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2749117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2749373Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2749538Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2749908Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2750113Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2750219Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2750315Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2750412Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2750414Z 2025-12-04T13:44:26.2750646Z [rank3]:[W1204 13:39:52.820144432 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2750817Z [rank1]:[W1204 13:39:53.456717317 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2750993Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2751246Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2751411Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2751780Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2751985Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2752090Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2752186Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2752294Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2752296Z 2025-12-04T13:44:26.2752546Z [rank1]:[W1204 13:39:53.458754332 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2752728Z [rank2]:[W1204 13:39:53.818491666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2752901Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2753167Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2753331Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2753700Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2753902Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2754006Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2754104Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2754200Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2754202Z 2025-12-04T13:44:26.2754439Z [rank2]:[W1204 13:39:53.820557791 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2754609Z [rank3]:[W1204 13:39:53.820272227 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2754786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2755041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2755205Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2755572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2755772Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2755877Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2755973Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2756069Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2756083Z 2025-12-04T13:44:26.2756315Z [rank3]:[W1204 13:39:53.822494498 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2756499Z [rank1]:[W1204 13:39:54.458920407 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2756685Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2756950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2757113Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2757517Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2757720Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2757823Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2757919Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2758016Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2758018Z 2025-12-04T13:44:26.2758251Z [rank1]:[W1204 13:39:54.460969442 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2758423Z [rank2]:[W1204 13:39:54.820658287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2758598Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2758854Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2759016Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2759385Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2759589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2759693Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2759788Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2759884Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2759886Z 2025-12-04T13:44:26.2760120Z [rank2]:[W1204 13:39:54.822334730 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2760307Z [rank3]:[W1204 13:39:54.822609734 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2760495Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2760765Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2760942Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2761313Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2761517Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2761624Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2761718Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2761815Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2761817Z 2025-12-04T13:44:26.2762050Z [rank3]:[W1204 13:39:54.824933893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2762220Z [rank1]:[W1204 13:39:55.461113287 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2762396Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2762651Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2762813Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2763181Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2763383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2763488Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2763584Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2763681Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2763683Z 2025-12-04T13:44:26.2763917Z [rank1]:[W1204 13:39:55.463225100 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2764098Z [rank2]:[W1204 13:39:55.822459115 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2764272Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2764549Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2764712Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2765092Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2765296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2765401Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2765502Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2765597Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2765599Z 2025-12-04T13:44:26.2765833Z [rank2]:[W1204 13:39:55.824007491 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2766003Z [rank3]:[W1204 13:39:55.825050728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2766182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2766438Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2766602Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2766971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2767172Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2767276Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2767372Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2767468Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2767509Z 2025-12-04T13:44:26.2767744Z [rank3]:[W1204 13:39:55.827219061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2767916Z [rank1]:[W1204 13:39:56.463374495 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2768110Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2768378Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2768554Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2768936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2769137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2769242Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2769337Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2769436Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2769437Z 2025-12-04T13:44:26.2769674Z [rank1]:[W1204 13:39:56.465282323 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2769845Z [rank2]:[W1204 13:39:56.824112847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2770020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2770277Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2770442Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2770810Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2771013Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2771118Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2771215Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2771312Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2771314Z 2025-12-04T13:44:26.2771548Z [rank2]:[W1204 13:39:56.825988646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2771718Z [rank3]:[W1204 13:39:56.827329197 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2771895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2772163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2772347Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2772725Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2772927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2773033Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2773128Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2773226Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2773229Z 2025-12-04T13:44:26.2773463Z [rank3]:[W1204 13:39:56.829575317 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2773634Z [rank1]:[W1204 13:39:57.465425088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2773810Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2774070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2774236Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2774605Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2774808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2774915Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2775009Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2775106Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2775108Z 2025-12-04T13:44:26.2775341Z [rank1]:[W1204 13:39:57.467209789 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2775511Z [rank2]:[W1204 13:39:57.826111321 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2775686Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2775942Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2776117Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2776510Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2776726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2776830Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2776927Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2777023Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2777025Z 2025-12-04T13:44:26.2777260Z [rank2]:[W1204 13:39:57.828536058 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2777430Z [rank3]:[W1204 13:39:57.829687063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2777637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2777893Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2778056Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2778424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2778629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2778733Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2778829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2778926Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2778928Z 2025-12-04T13:44:26.2779163Z [rank3]:[W1204 13:39:57.832029392 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2779334Z [rank1]:[W1204 13:39:58.467364304 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2779510Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2779764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2779947Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2780328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2780543Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2780666Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2780762Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2780860Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2780863Z 2025-12-04T13:44:26.2781096Z [rank1]:[W1204 13:39:58.469448598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2781268Z [rank2]:[W1204 13:39:58.828662434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2781442Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2781700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2781863Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2782231Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2782434Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2782538Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2782634Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2782731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2782733Z 2025-12-04T13:44:26.2782969Z [rank2]:[W1204 13:39:58.830791947 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2783141Z [rank3]:[W1204 13:39:58.832171937 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2783318Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2783576Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2783737Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2784128Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2784340Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2784446Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2784553Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2784651Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2784654Z 2025-12-04T13:44:26.2784887Z [rank3]:[W1204 13:39:58.834472897 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2785060Z [rank1]:[W1204 13:39:59.469873388 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2785238Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2785493Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2785655Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2786023Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2786226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2786332Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2786428Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2786524Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2786526Z 2025-12-04T13:44:26.2786759Z [rank1]:[W1204 13:39:59.472281035 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2786930Z [rank2]:[W1204 13:39:59.830916473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2787105Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2787362Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2787568Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2787951Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2788168Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2788284Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2788380Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2788491Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2788493Z 2025-12-04T13:44:26.2788727Z [rank2]:[W1204 13:39:59.832958698 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2788899Z [rank3]:[W1204 13:39:59.834575793 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2789075Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2789334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2789496Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2789866Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2790069Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2790174Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2790269Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2790367Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2790369Z 2025-12-04T13:44:26.2790603Z [rank3]:[W1204 13:39:59.836937881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2790773Z [rank1]:[W1204 13:40:00.472438570 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2790949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2791203Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2791366Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2791737Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2791961Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2792076Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2792170Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2792266Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2792268Z 2025-12-04T13:44:26.2792510Z [rank1]:[W1204 13:40:00.474438026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2792683Z [rank2]:[W1204 13:40:00.833093074 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2792857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2793115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2793279Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2793646Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2793849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2793954Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2794052Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2794147Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2794149Z 2025-12-04T13:44:26.2794384Z [rank2]:[W1204 13:40:00.834538872 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2794553Z [rank3]:[W1204 13:40:00.837287692 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2794728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2794986Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2795148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2795515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2795728Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2795833Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2795954Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2796051Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2796053Z 2025-12-04T13:44:26.2796300Z [rank3]:[W1204 13:40:00.839468344 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2796470Z [rank1]:[W1204 13:40:01.474589641 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2796647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2796902Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2797066Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2797435Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2797680Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2797784Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2797879Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2797977Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2797978Z 2025-12-04T13:44:26.2798213Z [rank1]:[W1204 13:40:01.477028108 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2798386Z [rank2]:[W1204 13:40:01.834702367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2798562Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2798821Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2798990Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2799356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2799558Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2799677Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2799773Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2799895Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2799897Z 2025-12-04T13:44:26.2800130Z [rank2]:[W1204 13:40:01.836855750 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2800314Z [rank3]:[W1204 13:40:01.839601390 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2800489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2800752Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2800916Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2801283Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2801483Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2801590Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2801685Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2801781Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2801784Z 2025-12-04T13:44:26.2802016Z [rank3]:[W1204 13:40:01.841763282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2802186Z [rank1]:[W1204 13:40:02.477171503 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2802361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2802617Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2802784Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2803150Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2803354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2803468Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2803563Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2803659Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2803661Z 2025-12-04T13:44:26.2803914Z [rank1]:[W1204 13:40:02.478416986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2804086Z [rank2]:[W1204 13:40:02.837015405 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2804270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2804527Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2804691Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2805062Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2805265Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2805368Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2805465Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2805561Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2805563Z 2025-12-04T13:44:26.2805797Z [rank2]:[W1204 13:40:02.838320717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2805969Z [rank3]:[W1204 13:40:02.841904158 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2806143Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2806401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2806562Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2806932Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2807137Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2807242Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2807348Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2807443Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2807445Z 2025-12-04T13:44:26.2807731Z [rank3]:[W1204 13:40:02.843071082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2807915Z [rank1]:[W1204 13:40:03.478564572 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2808103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2808357Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2808521Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2808888Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2809090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2809196Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2809291Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2809388Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2809390Z 2025-12-04T13:44:26.2809623Z [rank1]:[W1204 13:40:03.479841944 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2809795Z [rank2]:[W1204 13:40:03.838476962 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2809969Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2810225Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2810389Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2810758Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2810961Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2811064Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2811162Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2811259Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2811274Z 2025-12-04T13:44:26.2811510Z [rank2]:[W1204 13:40:03.839697125 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2811696Z [rank3]:[W1204 13:40:03.843180829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2811883Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2812149Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2812312Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2812681Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2812882Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2812987Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2813083Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2813179Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2813181Z 2025-12-04T13:44:26.2813414Z [rank3]:[W1204 13:40:03.844340663 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2813585Z [rank1]:[W1204 13:40:04.479988129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2813761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2814016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2814178Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2814545Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2814746Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2814851Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2814946Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2815042Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2815045Z 2025-12-04T13:44:26.2815286Z [rank1]:[W1204 13:40:04.481382629 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2815456Z [rank2]:[W1204 13:40:04.839885850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2815651Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2815910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2816086Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2816453Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2816657Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2816760Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2816857Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2816953Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2816955Z 2025-12-04T13:44:26.2817187Z [rank2]:[W1204 13:40:04.841754839 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2817358Z [rank3]:[W1204 13:40:04.844471920 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2817574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2817831Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2817996Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2818363Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2818565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2818670Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2818765Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2818860Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2818862Z 2025-12-04T13:44:26.2819095Z [rank3]:[W1204 13:40:04.845777711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2819281Z [rank1]:[W1204 13:40:05.481562434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2819469Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2819737Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2819913Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2820286Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2820490Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2820595Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2820689Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2820786Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2820788Z 2025-12-04T13:44:26.2821019Z [rank1]:[W1204 13:40:05.483159789 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2821191Z [rank2]:[W1204 13:40:05.841929174 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2821364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2821620Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2821784Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2822151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2822355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2822460Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2822558Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2822657Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2822659Z 2025-12-04T13:44:26.2822894Z [rank2]:[W1204 13:40:05.843859872 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2823066Z [rank3]:[W1204 13:40:05.845898777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2823252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2823519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2823692Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2824069Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2824271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2824375Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2824474Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2824570Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2824572Z 2025-12-04T13:44:26.2824807Z [rank3]:[W1204 13:40:05.847251668 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2824976Z [rank1]:[W1204 13:40:06.483332334 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2825150Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2825403Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2825568Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2825934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2826136Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2826242Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2826339Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2826436Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2826438Z 2025-12-04T13:44:26.2826674Z [rank1]:[W1204 13:40:06.485539076 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2826847Z [rank2]:[W1204 13:40:06.843997708 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2827038Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2827308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2827511Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2827893Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2828096Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2828200Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2828296Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2828393Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2828397Z 2025-12-04T13:44:26.2828628Z [rank2]:[W1204 13:40:06.845266170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2828798Z [rank3]:[W1204 13:40:06.847375374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2828976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2829236Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2829397Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2829767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2829968Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2830073Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2830168Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2830264Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2830267Z 2025-12-04T13:44:26.2830501Z [rank3]:[W1204 13:40:06.848524879 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2830670Z [rank1]:[W1204 13:40:07.485635113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2830845Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2831115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2831293Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2831674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2831885Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2831995Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2832090Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2832186Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2832188Z 2025-12-04T13:44:26.2832421Z [rank1]:[W1204 13:40:07.488103799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2832592Z [rank2]:[W1204 13:40:07.845417446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2832767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2833021Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2833187Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2833556Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2833762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2833866Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2833964Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2834059Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2834063Z 2025-12-04T13:44:26.2834299Z [rank2]:[W1204 13:40:07.847582979 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2834473Z [rank3]:[W1204 13:40:07.848635656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2834647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2834905Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2835078Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2835461Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2835689Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2835794Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2835893Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2835990Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2835992Z 2025-12-04T13:44:26.2836228Z [rank3]:[W1204 13:40:07.849869199 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2836398Z [rank1]:[W1204 13:40:08.488259875 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2836573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2836828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2836991Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2837357Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2837606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2837712Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2837808Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2837905Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2837907Z 2025-12-04T13:44:26.2838141Z [rank1]:[W1204 13:40:08.490053055 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2838312Z [rank2]:[W1204 13:40:08.847741045 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2838489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2838745Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2838923Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2839304Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2839519Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2839641Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2839738Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2839835Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2839838Z 2025-12-04T13:44:26.2840072Z [rank2]:[W1204 13:40:08.850071584 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2840245Z [rank3]:[W1204 13:40:08.850047334 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2840421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2840679Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2840841Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2841212Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2841415Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2841519Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2841618Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2841713Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2841715Z 2025-12-04T13:44:26.2841948Z [rank3]:[W1204 13:40:08.852388803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2842120Z [rank1]:[W1204 13:40:09.490180122 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2842298Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2842553Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2842717Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2843095Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2843317Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2843423Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2843518Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2843624Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2843626Z 2025-12-04T13:44:26.2843859Z [rank1]:[W1204 13:40:09.491947143 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2844030Z [rank2]:[W1204 13:40:09.850209960 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2844209Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2844472Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2844638Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2845006Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2845211Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2845316Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2845413Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2845510Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2845512Z 2025-12-04T13:44:26.2845745Z [rank2]:[W1204 13:40:09.852283865 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2845917Z [rank3]:[W1204 13:40:09.852493170 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2846091Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2846349Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2846514Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2846881Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2847105Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2847221Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2847317Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2847412Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2847425Z 2025-12-04T13:44:26.2847698Z [rank3]:[W1204 13:40:09.854938986 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2847869Z [rank1]:[W1204 13:40:10.492066250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2848047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2848303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2848471Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2848844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2849048Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2849154Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2849250Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2849347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2849349Z 2025-12-04T13:44:26.2849584Z [rank1]:[W1204 13:40:10.493311063 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2849757Z [rank2]:[W1204 13:40:10.852456940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2849932Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2850188Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2850351Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2850717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2850936Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2851052Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2851163Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2851260Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2851262Z 2025-12-04T13:44:26.2851506Z [rank2]:[W1204 13:40:10.854341179 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2851679Z [rank3]:[W1204 13:40:10.855083893 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2851854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2852115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2852280Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2852647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2852851Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2852955Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2853055Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2853152Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2853153Z 2025-12-04T13:44:26.2853389Z [rank3]:[W1204 13:40:10.857364253 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2853558Z [rank1]:[W1204 13:40:11.493457119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2853733Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2853988Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2854151Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2854520Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2854733Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2854840Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2854946Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2855063Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2855065Z 2025-12-04T13:44:26.2855301Z [rank1]:[W1204 13:40:11.494698372 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2855485Z [rank2]:[W1204 13:40:11.854563324 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2855666Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2855919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2856084Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2856452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2856655Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2856760Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2856858Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2856956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2856959Z 2025-12-04T13:44:26.2857194Z [rank2]:[W1204 13:40:11.856104290 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2857370Z [rank3]:[W1204 13:40:11.857483980 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2857595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2857855Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2858017Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2858388Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2858592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2858709Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2858804Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2858915Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2858930Z 2025-12-04T13:44:26.2859165Z [rank3]:[W1204 13:40:11.859482786 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2859347Z [rank1]:[W1204 13:40:12.494849529 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2859524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2859787Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2859952Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2860325Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2860527Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2860632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2860727Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2860824Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2860825Z 2025-12-04T13:44:26.2861060Z [rank1]:[W1204 13:40:12.497273026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2861232Z [rank2]:[W1204 13:40:12.856251447 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2861408Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2861664Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2861829Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2862202Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2862407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2862514Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2862621Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2862719Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2862721Z 2025-12-04T13:44:26.2862963Z [rank2]:[W1204 13:40:12.857477980 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2863145Z [rank3]:[W1204 13:40:12.859603463 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2863329Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2863588Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2863750Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2864121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2864328Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2864432Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2864528Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2864625Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2864627Z 2025-12-04T13:44:26.2864860Z [rank3]:[W1204 13:40:12.860777007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2865031Z [rank1]:[W1204 13:40:13.497469961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2865207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2865463Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2865626Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2865992Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2866196Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2866304Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2866400Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2866510Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2866512Z 2025-12-04T13:44:26.2866748Z [rank1]:[W1204 13:40:13.499974626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2866940Z [rank2]:[W1204 13:40:13.857612857 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2867117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2867390Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2867600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2867967Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2868171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2868276Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2868371Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2868470Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2868472Z 2025-12-04T13:44:26.2868707Z [rank2]:[W1204 13:40:13.858856459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2868879Z [rank3]:[W1204 13:40:13.860910134 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2869054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2869317Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2869479Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2869847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2870050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2870153Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2870251Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2870346Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2870361Z 2025-12-04T13:44:26.2870597Z [rank3]:[W1204 13:40:13.862080539 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2870643Z Result from 3 is 21102592.0 2025-12-04T13:44:26.2870687Z Result from 0 is 21102592.0 2025-12-04T13:44:26.2870761Z Result from 2 is 21102592.0 2025-12-04T13:44:26.2870804Z Result from 1 is 21102592.0 2025-12-04T13:44:26.2870977Z [rank1]:[W1204 13:40:14.500137543 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2871152Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2871426Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2871590Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2871959Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2872161Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2872271Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2872367Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2872466Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2872468Z 2025-12-04T13:44:26.2872704Z [rank1]:[W1204 13:40:14.501441184 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2872879Z [rank2]:[W1204 13:40:14.859070265 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2873056Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2873316Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2873481Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2873849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2874051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2874158Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2874254Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2874353Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2874365Z 2025-12-04T13:44:26.2874597Z [rank2]:[W1204 13:40:14.861327135 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2874780Z [rank3]:[W1204 13:40:14.862204006 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2874965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2875233Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2875398Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2875767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2875971Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2876075Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2876172Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2876267Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2876270Z 2025-12-04T13:44:26.2876504Z [rank3]:[W1204 13:40:14.863459548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2876675Z [rank1]:[W1204 13:40:15.501625620 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2876852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2877108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2877271Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2877720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2877924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2878029Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2878127Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2878225Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2878226Z 2025-12-04T13:44:26.2878478Z [rank1]:[W1204 13:40:15.503525878 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2878647Z [rank2]:[W1204 13:40:15.861503591 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2878850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2879105Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2879283Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2879650Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2879861Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2879969Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2880065Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2880164Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2880165Z 2025-12-04T13:44:26.2880399Z [rank2]:[W1204 13:40:15.863544567 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2880571Z [rank3]:[W1204 13:40:15.863609665 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2880747Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2881004Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2881166Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2881535Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2881739Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2881843Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2881940Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2882036Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2882038Z 2025-12-04T13:44:26.2882271Z [rank3]:[W1204 13:40:15.865313168 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2882456Z [rank1]:[W1204 13:40:16.503647176 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2882641Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2882911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2883083Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2883452Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2883654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2883761Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2883856Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2883952Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2883955Z 2025-12-04T13:44:26.2884194Z [rank1]:[W1204 13:40:16.505926806 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2884239Z PASSED [272.6694s] [ 33%] 2025-12-04T13:44:26.2884532Z distributed/test_dynamo_distributed.py::TestMultiProc::test_multiproc_autotune_dynamic_shapes I1204 13:40:16.731000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 100469 2025-12-04T13:44:26.2884686Z I1204 13:40:16.732000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 100470 2025-12-04T13:44:26.2884839Z I1204 13:40:16.733000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 100471 2025-12-04T13:44:26.2884988Z I1204 13:40:16.733000 67577 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 100472 2025-12-04T13:44:26.2885160Z [rank2]:[W1204 13:40:16.863695044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2885335Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2885592Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2885758Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2886125Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2886330Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2886455Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2886553Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2886665Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2886683Z 2025-12-04T13:44:26.2886917Z [rank2]:[W1204 13:40:16.864922817 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2887096Z [rank3]:[W1204 13:40:16.865440885 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2887271Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2887568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2887730Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2888099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2888301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2888406Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2888502Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2888599Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2888603Z 2025-12-04T13:44:26.2888844Z [rank3]:[W1204 13:40:16.866592670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2889014Z [rank1]:[W1204 13:40:17.506062943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2889190Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2889444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2889610Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2889980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2890183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2890288Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2890396Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2890493Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2890495Z 2025-12-04T13:44:26.2890741Z [rank1]:[W1204 13:40:17.508427881 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2890929Z [rank2]:[W1204 13:40:17.865057694 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2891120Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2891379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2891544Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2891914Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2892118Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2892222Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2892322Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2892418Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2892420Z 2025-12-04T13:44:26.2892656Z [rank2]:[W1204 13:40:17.867199217 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2897014Z [rank3]:[W1204 13:40:17.866708938 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2897208Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2897519Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2897686Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2898058Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2898262Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2898370Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2898465Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2898592Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2898594Z 2025-12-04T13:44:26.2898846Z [rank3]:[W1204 13:40:17.868637245 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2899032Z [rank1]:[W1204 13:40:18.508604578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2899208Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2899477Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2899646Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2900017Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2900220Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2900328Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2900423Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2900521Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2900523Z 2025-12-04T13:44:26.2900755Z [rank1]:[W1204 13:40:18.510848548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2900931Z [rank2]:[W1204 13:40:18.867317305 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2901107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2901365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2901529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2901901Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2902105Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2902208Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2902305Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2902402Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2902415Z 2025-12-04T13:44:26.2902649Z [rank2]:[W1204 13:40:18.869598595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2902830Z [rank3]:[W1204 13:40:18.868733834 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2903017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2903289Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2903453Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2903823Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2904026Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2904131Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2904227Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2904323Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2904325Z 2025-12-04T13:44:26.2904561Z [rank3]:[W1204 13:40:18.871149121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2904730Z [rank1]:[W1204 13:40:19.511040315 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2904905Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2905161Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2905326Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2905692Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2905895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2906000Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2906095Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2906193Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2906196Z 2025-12-04T13:44:26.2906429Z [rank1]:[W1204 13:40:19.512631590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2906612Z [rank2]:[W1204 13:40:19.869748332 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2906799Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2907066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2907238Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2907651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2907855Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2907961Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2908057Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2908155Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2908157Z 2025-12-04T13:44:26.2908398Z [rank2]:[W1204 13:40:19.870966135 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2908570Z [rank3]:[W1204 13:40:19.871288328 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2908745Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2909003Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2909166Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2909533Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2909736Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2909842Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2909938Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2910033Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2910035Z 2025-12-04T13:44:26.2910269Z [rank3]:[W1204 13:40:19.873101318 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2910455Z [rank1]:[W1204 13:40:20.512752688 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2910631Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2910911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2911075Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2911458Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2911660Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2911766Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2911862Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2911959Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2911960Z 2025-12-04T13:44:26.2912193Z [rank1]:[W1204 13:40:20.514769563 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2912363Z [rank2]:[W1204 13:40:20.871102113 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2912538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2912796Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2912961Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2913328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2913533Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2913636Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2913733Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2913830Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2913831Z 2025-12-04T13:44:26.2914068Z [rank2]:[W1204 13:40:20.872534441 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2914238Z [rank3]:[W1204 13:40:20.873237236 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2914424Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2914693Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2914867Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2915250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2915452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2915557Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2915654Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2915750Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2915752Z 2025-12-04T13:44:26.2915987Z [rank3]:[W1204 13:40:20.875605934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2916155Z [rank1]:[W1204 13:40:21.514906071 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2916330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2916584Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2916748Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2917119Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2917323Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2917428Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2917559Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2917658Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2917660Z 2025-12-04T13:44:26.2917892Z [rank1]:[W1204 13:40:21.516919287 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2918063Z [rank2]:[W1204 13:40:21.872672349 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2918237Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2918506Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2918682Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2919087Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2919290Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2919394Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2919490Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2919587Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2919590Z 2025-12-04T13:44:26.2919823Z [rank2]:[W1204 13:40:21.873885842 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2919993Z [rank3]:[W1204 13:40:21.875756031 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2920167Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2920425Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2920587Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2920954Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2921157Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2921262Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2921358Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2921454Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2921457Z 2025-12-04T13:44:26.2921695Z [rank3]:[W1204 13:40:21.877895144 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2921864Z [rank1]:[W1204 13:40:22.517115073 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2922039Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2922294Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2922468Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2922858Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2923070Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2923175Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2923270Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2923367Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2923368Z 2025-12-04T13:44:26.2923602Z [rank1]:[W1204 13:40:22.518937753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2923775Z [rank2]:[W1204 13:40:22.874029630 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2923949Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2924204Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2924370Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2924742Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2924946Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2925050Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2925147Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2925243Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2925245Z 2025-12-04T13:44:26.2925478Z [rank2]:[W1204 13:40:22.875246643 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2925652Z [rank3]:[W1204 13:40:22.878071481 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2925826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2926084Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2926258Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2926634Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2926846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2926960Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2927056Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2927152Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2927154Z 2025-12-04T13:44:26.2927388Z [rank3]:[W1204 13:40:22.879631487 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2927596Z [rank1]:[W1204 13:40:23.519119000 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2927770Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2928023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2928188Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2928559Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2928762Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2928866Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2928961Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2929056Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2929059Z 2025-12-04T13:44:26.2929291Z [rank1]:[W1204 13:40:23.521209894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2929463Z [rank2]:[W1204 13:40:23.875436610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2929637Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2929891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2930054Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2930454Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2930669Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2930772Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2930881Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2930977Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2930982Z 2025-12-04T13:44:26.2931214Z [rank2]:[W1204 13:40:23.877027565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2931385Z [rank3]:[W1204 13:40:23.879774765 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2931564Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2931822Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2931985Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2932353Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2932556Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2932658Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2932755Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2932849Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2932851Z 2025-12-04T13:44:26.2933086Z [rank3]:[W1204 13:40:23.882134693 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2933257Z [rank1]:[W1204 13:40:24.521330013 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2933432Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2933685Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2933848Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2934232Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2934452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2934569Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2934663Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2934770Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2934772Z 2025-12-04T13:44:26.2935005Z [rank1]:[W1204 13:40:24.523720710 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2935176Z [rank2]:[W1204 13:40:24.877159413 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2935353Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2935607Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2935772Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2936145Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2936354Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2936458Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2936553Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2936648Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2936653Z 2025-12-04T13:44:26.2936885Z [rank2]:[W1204 13:40:24.879175549 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2937056Z [rank3]:[W1204 13:40:24.882239232 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2937230Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2937529Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2937692Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2938059Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2938274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2938404Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2938500Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2938596Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2938597Z 2025-12-04T13:44:26.2938845Z [rank3]:[W1204 13:40:24.884554461 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2939016Z [rank1]:[W1204 13:40:25.523894547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2939190Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2939444Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2939608Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2939977Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2940177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2940283Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2940378Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2940475Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2940476Z 2025-12-04T13:44:26.2940712Z [rank1]:[W1204 13:40:25.526036530 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2940884Z [rank2]:[W1204 13:40:25.879340926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2941059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2941316Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2941480Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2941845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2942059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2942163Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2942279Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2942375Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2942378Z 2025-12-04T13:44:26.2942619Z [rank2]:[W1204 13:40:25.881163346 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2942793Z [rank3]:[W1204 13:40:25.884750338 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2942968Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2943223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2943386Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2943752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2943954Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2944058Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2944155Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2944251Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2944253Z 2025-12-04T13:44:26.2944486Z [rank3]:[W1204 13:40:25.886924660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2944655Z [rank1]:[W1204 13:40:26.526283006 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2944830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2945087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2945250Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2945616Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2945816Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2945933Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2946028Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2946148Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2946150Z 2025-12-04T13:44:26.2946383Z [rank1]:[W1204 13:40:26.528516297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2946565Z [rank2]:[W1204 13:40:26.881299464 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2946740Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2946995Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2947159Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2947568Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2947771Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2947875Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2947971Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2948069Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2948072Z 2025-12-04T13:44:26.2948303Z [rank2]:[W1204 13:40:26.882708994 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2948475Z [rank3]:[W1204 13:40:26.887065478 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2948649Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2948906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2949068Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2949437Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2949639Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2949764Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2949858Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2949953Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2949955Z 2025-12-04T13:44:26.2950215Z [rank3]:[W1204 13:40:26.889360528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2950385Z [rank1]:[W1204 13:40:27.528693264 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2950576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2950830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2950993Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2951360Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2951563Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2951668Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2951763Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2951859Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2951861Z 2025-12-04T13:44:26.2952094Z [rank1]:[W1204 13:40:27.531041533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2952267Z [rank2]:[W1204 13:40:27.882810453 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2952442Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2952696Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2952860Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2953226Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2953430Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2953535Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2953641Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2953738Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2953740Z 2025-12-04T13:44:26.2953983Z [rank2]:[W1204 13:40:27.884265111 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2954169Z [rank3]:[W1204 13:40:27.889471077 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2954343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2954608Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2954771Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2955139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2955341Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2955444Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2955539Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2955635Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2955636Z 2025-12-04T13:44:26.2955872Z [rank3]:[W1204 13:40:27.891820555 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2956043Z [rank1]:[W1204 13:40:28.531205721 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2956217Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2956469Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2956632Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2956998Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2957199Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2957304Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2957398Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2957545Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2957563Z 2025-12-04T13:44:26.2957795Z [rank1]:[W1204 13:40:28.533080390 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2957978Z [rank2]:[W1204 13:40:28.884393329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2958167Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2958435Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2958600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2958970Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2959176Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2959280Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2959376Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2959473Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2959476Z 2025-12-04T13:44:26.2959709Z [rank2]:[W1204 13:40:28.886379566 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2959880Z [rank3]:[W1204 13:40:28.891942974 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2960054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2960314Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2960475Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2960845Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2961049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2961152Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2961247Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2961343Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2961345Z 2025-12-04T13:44:26.2961581Z [rank3]:[W1204 13:40:28.894236383 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2961761Z [rank1]:[W1204 13:40:29.533233808 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2961964Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2962219Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2962397Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2962768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2962970Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2963076Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2963170Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2963267Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2963269Z 2025-12-04T13:44:26.2963502Z [rank1]:[W1204 13:40:29.534478620 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2963674Z [rank2]:[W1204 13:40:29.886503354 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2963849Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2964106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2964271Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2964641Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2964846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2964951Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2965046Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2965146Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2965148Z 2025-12-04T13:44:26.2965381Z [rank2]:[W1204 13:40:29.888268216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2965563Z [rank3]:[W1204 13:40:29.894379352 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2965747Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2966016Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2966188Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2966558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2966761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2966867Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2966963Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2967060Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2967063Z 2025-12-04T13:44:26.2967298Z [rank3]:[W1204 13:40:29.896738490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2967469Z [rank1]:[W1204 13:40:30.534643998 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2967687Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2967944Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2968106Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2968472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2968673Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2968778Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2968874Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2968971Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2968973Z 2025-12-04T13:44:26.2969207Z [rank1]:[W1204 13:40:30.537151673 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2969376Z [rank2]:[W1204 13:40:30.888387305 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2969568Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2969835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2970014Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2970393Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2970597Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2970700Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2970796Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2970893Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2970894Z 2025-12-04T13:44:26.2971127Z [rank2]:[W1204 13:40:30.890541968 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2971299Z [rank3]:[W1204 13:40:30.896846069 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2971473Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2971730Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2971891Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2972260Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2972462Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2972566Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2972662Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2972760Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2972762Z 2025-12-04T13:44:26.2972995Z [rank3]:[W1204 13:40:30.899077860 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2973165Z [rank1]:[W1204 13:40:31.537300982 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2973353Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2973611Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2973795Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2974174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2974376Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2974482Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2974576Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2974673Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2974676Z 2025-12-04T13:44:26.2974909Z [rank1]:[W1204 13:40:31.540074531 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2975078Z [rank2]:[W1204 13:40:31.890673067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2975252Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2975509Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2975674Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2976041Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2976243Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2976348Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2976442Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2976540Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2976542Z 2025-12-04T13:44:26.2976775Z [rank2]:[W1204 13:40:31.891888670 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2976946Z [rank3]:[W1204 13:40:31.899189890 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2977119Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2977386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2977689Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2978081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2978296Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2978401Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2978498Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2978592Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2978594Z 2025-12-04T13:44:26.2978829Z [rank3]:[W1204 13:40:31.900412343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2978998Z [rank1]:[W1204 13:40:32.540226850 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2979174Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2979430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2979594Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2979962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2980165Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2980269Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2980364Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2980460Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2980462Z 2025-12-04T13:44:26.2980696Z [rank1]:[W1204 13:40:32.542441921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2980865Z [rank2]:[W1204 13:40:32.892047028 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2981041Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2981296Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2981472Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2981850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2982074Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2982181Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2982276Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2982375Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2982377Z 2025-12-04T13:44:26.2982609Z [rank2]:[W1204 13:40:32.893616024 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2982780Z [rank3]:[W1204 13:40:32.900550812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2982953Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2983210Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2983374Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2983743Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2983945Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2984051Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2984146Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2984243Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2984244Z 2025-12-04T13:44:26.2984480Z [rank3]:[W1204 13:40:32.901789595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2984651Z [rank1]:[W1204 13:40:33.542572330 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2984826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2985080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2985252Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2985628Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2985839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2985954Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2986049Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2986145Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2986147Z 2025-12-04T13:44:26.2986380Z [rank1]:[W1204 13:40:33.544930128 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2986551Z [rank2]:[W1204 13:40:33.893729943 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2986729Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2986985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2987148Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2987565Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2987769Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2987873Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2987969Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2988066Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2988069Z 2025-12-04T13:44:26.2988301Z [rank2]:[W1204 13:40:33.895311068 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2988472Z [rank3]:[W1204 13:40:33.901887874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2988646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2988910Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2989072Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2989455Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2989686Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2989789Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2989884Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2989991Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2989993Z 2025-12-04T13:44:26.2990227Z [rank3]:[W1204 13:40:33.903025679 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2990397Z [rank1]:[W1204 13:40:34.545106576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2990572Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2990828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2990990Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2991356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2991559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2991664Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2991759Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2991857Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2991859Z 2025-12-04T13:44:26.2992094Z [rank1]:[W1204 13:40:34.547483674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2992265Z [rank2]:[W1204 13:40:34.895463847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2992441Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2992697Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2992862Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2993236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2993467Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2993583Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2993677Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2993773Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2993785Z 2025-12-04T13:44:26.2994018Z [rank2]:[W1204 13:40:34.896758909 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2994189Z [rank3]:[W1204 13:40:34.903160848 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2994364Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2994622Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2994783Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2995151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2995355Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2995460Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2995557Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2995652Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2995654Z 2025-12-04T13:44:26.2995888Z [rank3]:[W1204 13:40:34.904339272 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2996058Z [rank1]:[W1204 13:40:35.547633243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2996231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2996486Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2996649Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2997015Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2997226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2997342Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2997449Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2997582Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2997585Z 2025-12-04T13:44:26.2997832Z [rank1]:[W1204 13:40:35.550010701 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2998001Z [rank2]:[W1204 13:40:35.896889518 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.2998176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.2998430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.2998594Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2998961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.2999164Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.2999268Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.2999365Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2999462Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.2999464Z 2025-12-04T13:44:26.2999700Z [rank2]:[W1204 13:40:35.898609750 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.2999872Z [rank3]:[W1204 13:40:35.904440192 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3000045Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3000302Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3000467Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3000834Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3001049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3001152Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3001264Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3001372Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3001373Z 2025-12-04T13:44:26.3001607Z [rank3]:[W1204 13:40:35.907224941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3001788Z [rank1]:[W1204 13:40:36.550157180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3001965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3002221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3002385Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3002755Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3002956Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3003062Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3003158Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3003254Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3003257Z 2025-12-04T13:44:26.3003490Z [rank1]:[W1204 13:40:36.552710294 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3003660Z [rank2]:[W1204 13:40:36.898755899 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3003834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3004097Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3004261Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3004631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3004833Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3004948Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3005043Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3005139Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3005162Z 2025-12-04T13:44:26.3005394Z [rank2]:[W1204 13:40:36.900363594 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3005574Z [rank3]:[W1204 13:40:36.907344901 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3005748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3006006Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3006170Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3006540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3006741Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3006845Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3006940Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3007035Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3007037Z 2025-12-04T13:44:26.3007271Z [rank3]:[W1204 13:40:36.908607243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3007443Z [rank1]:[W1204 13:40:37.552859232 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3007660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3007915Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3008076Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3008446Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3008650Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3008755Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3008875Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3008970Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3008972Z 2025-12-04T13:44:26.3009218Z [rank1]:[W1204 13:40:37.555177702 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3009399Z [rank2]:[W1204 13:40:37.900485873 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3009589Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3009843Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3010008Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3010379Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3010583Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3010688Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3010783Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3010881Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3010883Z 2025-12-04T13:44:26.3011116Z [rank2]:[W1204 13:40:37.902656456 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3011286Z [rank3]:[W1204 13:40:37.908703123 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3011459Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3011716Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3011879Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3012250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3012455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3012559Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3012655Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3012762Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3012765Z 2025-12-04T13:44:26.3013002Z [rank3]:[W1204 13:40:37.909812149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3013194Z [rank1]:[W1204 13:40:38.555370400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3013368Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3013635Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3013797Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3014163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3014365Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3014470Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3014567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3014661Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3014664Z 2025-12-04T13:44:26.3014900Z [rank1]:[W1204 13:40:38.557800956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3015072Z [rank2]:[W1204 13:40:38.903831782 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3015247Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3015501Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3015664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3016032Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3016233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3016338Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3016432Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3016529Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3016542Z 2025-12-04T13:44:26.3016774Z [rank2]:[W1204 13:40:38.905821738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3016956Z [rank3]:[W1204 13:40:38.909933978 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3017143Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3017414Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3017608Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3017979Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3018183Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3018286Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3018381Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3018478Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3018480Z 2025-12-04T13:44:26.3018714Z [rank3]:[W1204 13:40:38.911118892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3018887Z [rank1]:[W1204 13:40:39.557933926 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3019062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3019317Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3019480Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3019847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3020049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3020155Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3020251Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3020347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3020349Z 2025-12-04T13:44:26.3020582Z [rank1]:[W1204 13:40:39.560264845 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3020770Z [rank2]:[W1204 13:40:39.905961838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3020957Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3021229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3021411Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3021783Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3021986Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3022091Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3022186Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3022283Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3022285Z 2025-12-04T13:44:26.3022518Z [rank2]:[W1204 13:40:39.907235870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3022692Z [rank3]:[W1204 13:40:39.911243412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3022866Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3023125Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3023288Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3023654Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3023860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3023965Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3024064Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3024159Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3024162Z 2025-12-04T13:44:26.3024395Z [rank3]:[W1204 13:40:39.912398187 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3024577Z [rank1]:[W1204 13:40:40.560719047 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3024751Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3025019Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3025193Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3025570Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3025774Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3025878Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3025976Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3026071Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3026072Z 2025-12-04T13:44:26.3026307Z [rank1]:[W1204 13:40:40.562825611 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3026477Z [rank2]:[W1204 13:40:40.907363249 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3026654Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3026911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3027075Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3027444Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3027685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3027791Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3027886Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3027983Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3027985Z 2025-12-04T13:44:26.3028220Z [rank2]:[W1204 13:40:40.908588603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3028392Z [rank3]:[W1204 13:40:40.912531186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3028580Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3028848Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3029028Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3029409Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3029612Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3029720Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3029815Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3029912Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3029915Z 2025-12-04T13:44:26.3030147Z [rank3]:[W1204 13:40:40.913780449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3030318Z [rank1]:[W1204 13:40:41.562981770 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3030494Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3030752Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3030915Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3031282Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3031484Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3031593Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3031688Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3031784Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3031787Z 2025-12-04T13:44:26.3032020Z [rank1]:[W1204 13:40:41.564724642 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3032190Z [rank2]:[W1204 13:40:41.908688803 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3032365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3032630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3032805Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3033195Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3033399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3033503Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3033598Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3033694Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3033698Z 2025-12-04T13:44:26.3033929Z [rank2]:[W1204 13:40:41.910281958 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3034101Z [rank3]:[W1204 13:40:41.913906509 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3034276Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3034531Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3034695Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3035064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3035267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3035371Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3035468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3035563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3035566Z 2025-12-04T13:44:26.3035802Z [rank3]:[W1204 13:40:41.915424815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3035973Z [rank1]:[W1204 13:40:42.564905670 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3036147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3036402Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3036574Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3036950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3037178Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3037283Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3037380Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3037513Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3037515Z 2025-12-04T13:44:26.3037753Z [rank1]:[W1204 13:40:42.567055163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3037925Z [rank2]:[W1204 13:40:42.910375629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3038100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3038356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3038523Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3038890Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3039092Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3039198Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3039292Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3039392Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3039394Z 2025-12-04T13:44:26.3039627Z [rank2]:[W1204 13:40:42.911792198 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3039798Z [rank3]:[W1204 13:40:42.915522606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3039976Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3040230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3040407Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3040785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3040999Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3041116Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3041214Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3041312Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3041313Z 2025-12-04T13:44:26.3041544Z [rank3]:[W1204 13:40:42.916660231 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3041717Z [rank1]:[W1204 13:40:43.567235712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3041891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3042150Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3042312Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3042679Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3042881Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3042984Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3043080Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3043175Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3043177Z 2025-12-04T13:44:26.3043410Z [rank1]:[W1204 13:40:43.569678859 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3043579Z [rank2]:[W1204 13:40:43.911931967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3043756Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3044009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3044172Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3044565Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3044778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3044884Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3044989Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3045086Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3045089Z 2025-12-04T13:44:26.3045322Z [rank2]:[W1204 13:40:43.913904994 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3045492Z [rank3]:[W1204 13:40:43.916759291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3045667Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3045922Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3046084Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3046455Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3046661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3046763Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3046859Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3046955Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3046957Z 2025-12-04T13:44:26.3047189Z [rank3]:[W1204 13:40:43.917870697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3047363Z [rank1]:[W1204 13:40:44.569824258 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3047579Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3047834Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3047996Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3048363Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3048606Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3048724Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3048820Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3048929Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3048931Z 2025-12-04T13:44:26.3049167Z [rank1]:[W1204 13:40:44.572495060 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3049339Z [rank2]:[W1204 13:40:44.914070473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3049514Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3049773Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3049936Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3050304Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3050510Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3050616Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3050711Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3050808Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3050809Z 2025-12-04T13:44:26.3051044Z [rank2]:[W1204 13:40:44.916042680 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3051217Z [rank3]:[W1204 13:40:44.917994907 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3051391Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3051647Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3051810Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3052176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3052388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3052508Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3052615Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3052711Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3052713Z 2025-12-04T13:44:26.3052959Z [rank3]:[W1204 13:40:44.919169572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3053132Z [rank1]:[W1204 13:40:45.572680288 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3053306Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3053561Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3053724Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3054097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3054299Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3054402Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3054499Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3054594Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3054595Z 2025-12-04T13:44:26.3054828Z [rank1]:[W1204 13:40:45.574576907 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3054998Z [rank2]:[W1204 13:40:45.916183750 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3055174Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3055430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3055593Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3055962Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3056176Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3056280Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3056385Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3056494Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3056495Z 2025-12-04T13:44:26.3056736Z [rank2]:[W1204 13:40:45.917442852 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3056908Z [rank3]:[W1204 13:40:45.919305231 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3057085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3057345Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3057552Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3057919Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3058121Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3058227Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3058321Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3058419Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3058421Z 2025-12-04T13:44:26.3058653Z [rank3]:[W1204 13:40:45.920765619 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3058824Z [rank1]:[W1204 13:40:46.574761246 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3058997Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3059253Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3059417Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3059789Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3059991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3060109Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3060205Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3060314Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3060330Z 2025-12-04T13:44:26.3060563Z [rank1]:[W1204 13:40:46.576665764 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3060745Z [rank2]:[W1204 13:40:46.917594792 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3060922Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3061178Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3061340Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3061713Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3061918Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3062024Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3062120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3062217Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3062220Z 2025-12-04T13:44:26.3062453Z [rank2]:[W1204 13:40:46.918907943 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3062624Z [rank3]:[W1204 13:40:46.920911639 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3062801Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3063056Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3063219Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3063584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3063788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3063892Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3063998Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3064095Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3064097Z 2025-12-04T13:44:26.3064343Z [rank3]:[W1204 13:40:46.922799288 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3064525Z [rank1]:[W1204 13:40:47.576792664 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3064710Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3064965Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3065126Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3065494Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3065698Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3065802Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3065898Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3065992Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3065994Z 2025-12-04T13:44:26.3066229Z [rank1]:[W1204 13:40:47.579205431 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3066400Z [rank2]:[W1204 13:40:47.919078783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3066576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3066832Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3066994Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3067362Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3067605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3067711Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3067806Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3067924Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3067926Z 2025-12-04T13:44:26.3068173Z [rank2]:[W1204 13:40:47.921114488 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3068357Z [rank3]:[W1204 13:40:47.922971527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3068532Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3068799Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3068963Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3069328Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3069530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3069635Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3069733Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3069830Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3069832Z 2025-12-04T13:44:26.3070064Z [rank3]:[W1204 13:40:47.924278459 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3070235Z [rank1]:[W1204 13:40:48.579335572 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3070409Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3070671Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3070833Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3071203Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3071405Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3071507Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3071603Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3071698Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3071710Z 2025-12-04T13:44:26.3071943Z [rank1]:[W1204 13:40:48.581510104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3072124Z [rank2]:[W1204 13:40:48.921261628 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3072315Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3072582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3072746Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3073116Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3073318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3073422Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3073518Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3073615Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3073617Z 2025-12-04T13:44:26.3073852Z [rank2]:[W1204 13:40:48.922520170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3074022Z [rank3]:[W1204 13:40:48.924404269 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3074196Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3074451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3074615Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3074987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3075191Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3075298Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3075392Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3075489Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3075491Z 2025-12-04T13:44:26.3075723Z [rank3]:[W1204 13:40:48.926123211 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3075904Z [rank1]:[W1204 13:40:49.581698623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3076090Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3076356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3076527Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3076896Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3077101Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3077206Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3077303Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3077399Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3077400Z 2025-12-04T13:44:26.3077667Z [rank1]:[W1204 13:40:49.583795887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3077838Z [rank2]:[W1204 13:40:49.922718279 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3078014Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3078270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3078434Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3078803Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3079007Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3079112Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3079206Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3079305Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3079306Z 2025-12-04T13:44:26.3079542Z [rank2]:[W1204 13:40:49.924720725 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3079726Z [rank3]:[W1204 13:40:49.926321060 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3079901Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3080185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3080349Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3080737Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3080941Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3081046Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3081142Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3081238Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3081240Z 2025-12-04T13:44:26.3081475Z [rank3]:[W1204 13:40:49.928683289 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3081646Z [rank1]:[W1204 13:40:50.583941677 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3081818Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3082073Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3082236Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3082601Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3082804Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3082908Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3083005Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3083100Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3083102Z 2025-12-04T13:44:26.3083339Z [rank1]:[W1204 13:40:50.586336635 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3083509Z [rank2]:[W1204 13:40:50.924850306 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3083697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3083970Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3084146Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3084523Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3084725Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3084829Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3084925Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3085024Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3085025Z 2025-12-04T13:44:26.3085259Z [rank2]:[W1204 13:40:50.926928110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3085429Z [rank3]:[W1204 13:40:50.928831329 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3085603Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3085862Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3086026Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3086394Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3086596Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3086701Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3086795Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3086893Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3086895Z 2025-12-04T13:44:26.3087127Z [rank3]:[W1204 13:40:50.931083609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3087298Z [rank1]:[W1204 13:40:51.586463356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3087504Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3087773Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3087948Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3088343Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3088546Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3088651Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3088746Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3088842Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3088846Z 2025-12-04T13:44:26.3089079Z [rank1]:[W1204 13:40:51.588824724 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3089251Z [rank2]:[W1204 13:40:51.927078801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3089425Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3089681Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3089844Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3090216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3090419Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3090525Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3090620Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3090717Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3090721Z 2025-12-04T13:44:26.3090954Z [rank2]:[W1204 13:40:51.928290294 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3091123Z [rank3]:[W1204 13:40:51.931202190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3091299Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3091553Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3091729Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3092118Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3092332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3092438Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3092534Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3092631Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3092633Z 2025-12-04T13:44:26.3092864Z [rank3]:[W1204 13:40:51.932367895 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3093035Z [rank1]:[W1204 13:40:52.588967994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3093209Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3093465Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3093629Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3093997Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3094200Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3094303Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3094399Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3094496Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3094498Z 2025-12-04T13:44:26.3094735Z [rank1]:[W1204 13:40:52.591387041 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3094907Z [rank2]:[W1204 13:40:52.928617490 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3095082Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3095340Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3095516Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3095896Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3096115Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3096229Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3096326Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3096422Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3096424Z 2025-12-04T13:44:26.3096661Z [rank2]:[W1204 13:40:52.931011458 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3096833Z [rank3]:[W1204 13:40:52.932542984 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3097007Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3097263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3097426Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3097808Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3098011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3098116Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3098211Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3098307Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3098310Z 2025-12-04T13:44:26.3098542Z [rank3]:[W1204 13:40:52.934193688 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3098714Z [rank1]:[W1204 13:40:53.591573131 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3098889Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3099148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3099311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3099708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3099922Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3100025Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3100133Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3100230Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3100233Z 2025-12-04T13:44:26.3100467Z [rank1]:[W1204 13:40:53.594020657 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3100637Z [rank2]:[W1204 13:40:53.931158728 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3100811Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3101067Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3101233Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3101604Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3101806Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3101910Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3102006Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3102102Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3102103Z 2025-12-04T13:44:26.3102337Z [rank2]:[W1204 13:40:53.933537846 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3102507Z [rank3]:[W1204 13:40:53.934327719 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3102683Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3102938Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3103102Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3103486Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3103697Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3103812Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3103906Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3104015Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3104017Z 2025-12-04T13:44:26.3104249Z [rank3]:[W1204 13:40:53.936637378 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3104420Z [rank1]:[W1204 13:40:54.594176438 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3104595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3104852Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3105016Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3105385Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3105590Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3105695Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3105790Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3105885Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3105888Z 2025-12-04T13:44:26.3106120Z [rank1]:[W1204 13:40:54.595674965 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3106292Z [rank2]:[W1204 13:40:54.933657847 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3106466Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3106722Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3106884Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3107252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3107465Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3107621Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3107718Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3107814Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3107816Z 2025-12-04T13:44:26.3108062Z [rank2]:[W1204 13:40:54.935921788 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3108235Z [rank3]:[W1204 13:40:54.936733890 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3108411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3108668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3108830Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3109197Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3109398Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3109503Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3109599Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3109695Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3109697Z 2025-12-04T13:44:26.3109932Z [rank3]:[W1204 13:40:55.939102608 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3110103Z [rank1]:[W1204 13:40:55.595814326 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3110277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3110534Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3110697Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3111063Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3111278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3111381Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3111506Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3111602Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3111604Z 2025-12-04T13:44:26.3111848Z [rank1]:[W1204 13:40:55.598275081 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3112021Z [rank2]:[W1204 13:40:55.936057879 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3112197Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3112454Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3112618Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3112987Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3113189Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3113294Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3113390Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3113487Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3113489Z 2025-12-04T13:44:26.3113723Z [rank2]:[W1204 13:40:56.937280622 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3113894Z [rank3]:[W1204 13:40:56.939232589 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3114068Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3114328Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3114492Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3114859Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3115059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3115175Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3115270Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3115389Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3115391Z 2025-12-04T13:44:26.3115622Z [rank3]:[W1204 13:40:56.941686395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3115803Z [rank1]:[W1204 13:40:56.598438682 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3115977Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3116234Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3116398Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3116770Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3116973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3117077Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3117173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3117269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3117272Z 2025-12-04T13:44:26.3117551Z [rank1]:[W1204 13:40:56.600161964 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3117723Z [rank2]:[W1204 13:40:57.937430322 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3117896Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3118152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3118315Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3118689Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3118893Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3119013Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3119109Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3119204Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3119207Z 2025-12-04T13:44:26.3119467Z [rank2]:[W1204 13:40:57.939360100 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3119637Z [rank3]:[W1204 13:40:57.941829446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3119823Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3120080Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3120243Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3120613Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3120816Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3120920Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3121015Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3121112Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3121113Z 2025-12-04T13:44:26.3121347Z [rank3]:[W1204 13:40:57.943884171 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3121520Z [rank1]:[W1204 13:40:57.600307355 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3121695Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3121950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3122117Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3122484Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3122687Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3122790Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3122896Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3122994Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3122997Z 2025-12-04T13:44:26.3123242Z [rank1]:[W1204 13:40:57.602520666 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3123423Z [rank2]:[W1204 13:40:58.939504881 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3123597Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3123880Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3124043Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3124412Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3124616Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3124719Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3124816Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3124915Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3124916Z 2025-12-04T13:44:26.3125154Z [rank2]:[W1204 13:40:58.941175434 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3125325Z [rank3]:[W1204 13:40:58.944030562 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3125500Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3125755Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3125918Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3126284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3126485Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3126590Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3126684Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3126780Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3126804Z 2025-12-04T13:44:26.3127037Z [rank3]:[W1204 13:40:58.946101626 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3127220Z [rank1]:[W1204 13:40:58.602677577 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3127406Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3127723Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3127887Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3128256Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3128459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3128562Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3128659Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3128754Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3128758Z 2025-12-04T13:44:26.3128992Z [rank1]:[W1204 13:40:58.605251011 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3129164Z [rank2]:[W1204 13:40:59.941315426 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3129339Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3129598Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3129761Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3130129Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3130333Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3130436Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3130535Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3130630Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3130632Z 2025-12-04T13:44:26.3130878Z [rank2]:[W1204 13:40:59.942541279 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3131047Z [rank3]:[W1204 13:40:59.946234128 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3131249Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3131505Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3131681Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3132052Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3132254Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3132360Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3132454Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3132551Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3132553Z 2025-12-04T13:44:26.3132785Z [rank3]:[W1204 13:40:59.948518368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3132957Z [rank1]:[W1204 13:40:59.605399582 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3133135Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3133391Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3133555Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3133926Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3134130Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3134235Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3134330Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3134426Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3134430Z 2025-12-04T13:44:26.3134662Z [rank1]:[W1204 13:40:59.607768010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3134846Z [rank2]:[W1204 13:41:00.942671190 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3135031Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3135299Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3135472Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3135842Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3136049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3136155Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3136251Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3136346Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3136348Z 2025-12-04T13:44:26.3136586Z [rank2]:[W1204 13:41:00.944134468 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3136756Z [rank3]:[W1204 13:41:00.948646649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3136929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3137185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3137347Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3137762Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3137962Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3138067Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3138163Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3138259Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3138261Z 2025-12-04T13:44:26.3138493Z [rank3]:[W1204 13:41:00.951674233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3138666Z [rank1]:[W1204 13:41:00.607890341 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3138854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3139123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3139297Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3139684Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3139887Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3139991Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3140088Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3140185Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3140187Z 2025-12-04T13:44:26.3140421Z [rank1]:[W1204 13:41:00.609500006 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3140592Z [rank2]:[W1204 13:41:01.944264939 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3140767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3141023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3141187Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3141557Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3141761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3141864Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3141961Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3142057Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3142059Z 2025-12-04T13:44:26.3142291Z [rank2]:[W1204 13:41:01.946390563 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3142461Z [rank3]:[W1204 13:41:01.951801294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3142647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3142904Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3143090Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3143467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3143668Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3143773Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3143867Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3143964Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3143966Z 2025-12-04T13:44:26.3144199Z [rank3]:[W1204 13:41:01.954127753 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3144369Z [rank1]:[W1204 13:41:01.609670997 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3148363Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3148634Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3148802Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3149174Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3149377Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3149481Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3149577Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3149674Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3149677Z 2025-12-04T13:44:26.3149909Z [rank1]:[W1204 13:41:01.612117923 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3150080Z [rank2]:[W1204 13:41:02.946533654 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3150254Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3150539Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3150715Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3151103Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3151318Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3151424Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3151519Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3151615Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3151617Z 2025-12-04T13:44:26.3151855Z [rank2]:[W1204 13:41:02.948687577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3152025Z [rank3]:[W1204 13:41:02.954250035 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3152205Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3152461Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3152625Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3152998Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3153201Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3153306Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3153400Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3153497Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3153499Z 2025-12-04T13:44:26.3153732Z [rank3]:[W1204 13:41:02.956596853 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3153905Z [rank1]:[W1204 13:41:02.612299633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3154080Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3154334Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3154507Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3154887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3155110Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3155216Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3155312Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3155409Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3155411Z 2025-12-04T13:44:26.3155645Z [rank1]:[W1204 13:41:02.614273980 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3155817Z [rank2]:[W1204 13:41:03.948868367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3155990Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3156245Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3156408Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3156776Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3156980Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3157084Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3157182Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3157278Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3157280Z 2025-12-04T13:44:26.3157557Z [rank2]:[W1204 13:41:03.951088489 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3157728Z [rank3]:[W1204 13:41:03.956759474 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3157904Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3158159Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3158339Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3158717Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3158936Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3159054Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3159150Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3159247Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3159250Z 2025-12-04T13:44:26.3159480Z [rank3]:[W1204 13:41:03.958911087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3159654Z [rank1]:[W1204 13:41:03.614486370 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3159830Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3160083Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3160245Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3160611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3160814Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3160917Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3161013Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3161111Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3161114Z 2025-12-04T13:44:26.3161346Z [rank1]:[W1204 13:41:03.616770590 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3161517Z [rank2]:[W1204 13:41:04.951276059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3161690Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3161948Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3162110Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3162488Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3162714Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3162818Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3162913Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3163019Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3163021Z 2025-12-04T13:44:26.3163255Z [rank2]:[W1204 13:41:04.953056080 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3163425Z [rank3]:[W1204 13:41:04.959050039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3163601Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3163857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3164020Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3164388Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3164590Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3164694Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3164788Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3164884Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3164886Z 2025-12-04T13:44:26.3165119Z [rank3]:[W1204 13:41:04.961350168 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3165290Z [rank1]:[W1204 13:41:04.616943191 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3165465Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3165720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3165883Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3166257Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3166479Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3166593Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3166687Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3166784Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3166796Z 2025-12-04T13:44:26.3167029Z [rank1]:[W1204 13:41:04.619086924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3167201Z [rank2]:[W1204 13:41:05.953234151 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3167375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3167672Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3167835Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3168203Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3168407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3168512Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3168608Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3168704Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3168706Z 2025-12-04T13:44:26.3168939Z [rank2]:[W1204 13:41:05.955178998 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3169109Z [rank3]:[W1204 13:41:05.961513829 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3169283Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3170772Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3170939Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3171936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3172156Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3172262Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3172374Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3172484Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3172486Z 2025-12-04T13:44:26.3172720Z [rank3]:[W1204 13:41:05.963880348 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3172895Z [rank1]:[W1204 13:41:05.619277574 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3173074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3173330Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3173496Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3173863Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3174067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3174171Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3174267Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3174363Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3174366Z 2025-12-04T13:44:26.3174597Z [rank1]:[W1204 13:41:05.621616163 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3174767Z [rank2]:[W1204 13:41:06.955341669 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3174944Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3175201Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3175423Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3175790Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3176027Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3176130Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3176226Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3176336Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3176339Z 2025-12-04T13:44:26.3176571Z [rank2]:[W1204 13:41:06.957454033 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3176742Z [rank3]:[W1204 13:41:06.964024339 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3176917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3177172Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3177337Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3177753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3177955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3178061Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3178156Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3178251Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3178254Z 2025-12-04T13:44:26.3178488Z [rank3]:[W1204 13:41:06.966340788 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3178659Z [rank1]:[W1204 13:41:06.621751175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3178834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3179090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3179253Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3179645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3179848Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3179984Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3180079Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3180174Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3180189Z 2025-12-04T13:44:26.3180422Z [rank1]:[W1204 13:41:06.623701102 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3180593Z [rank2]:[W1204 13:41:07.957602925 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3180768Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3181023Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3181186Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3181555Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3181761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3181866Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3181966Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3182063Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3182065Z 2025-12-04T13:44:26.3182298Z [rank2]:[W1204 13:41:07.959103602 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3182470Z [rank3]:[W1204 13:41:07.966464990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3182642Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3182898Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3183060Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3183438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3183642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3183745Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3183863Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3183960Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3183962Z 2025-12-04T13:44:26.3184196Z [rank3]:[W1204 13:41:07.968805609 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3184377Z [rank1]:[W1204 13:41:07.623870213 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3184552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3184808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3184971Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3185339Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3185540Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3185643Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3185737Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3185835Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3185836Z 2025-12-04T13:44:26.3186070Z [rank1]:[W1204 13:41:07.626076425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3186244Z [rank2]:[W1204 13:41:08.959264653 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3186419Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3186675Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3186839Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3187206Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3187420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3187566Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3187661Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3187789Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3187791Z 2025-12-04T13:44:26.3188024Z [rank2]:[W1204 13:41:08.961171671 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3188209Z [rank3]:[W1204 13:41:08.968928911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3188384Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3188641Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3188805Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3189172Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3189375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3189478Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3189573Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3189668Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3189671Z 2025-12-04T13:44:26.3189905Z [rank3]:[W1204 13:41:08.971318449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3190078Z [rank1]:[W1204 13:41:08.626245416 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3190255Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3190513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3190677Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3191046Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3191267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3191371Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3191465Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3191562Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3191574Z 2025-12-04T13:44:26.3191817Z [rank1]:[W1204 13:41:08.627995208 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3191989Z [rank2]:[W1204 13:41:09.961350183 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3192176Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3192430Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3192592Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3192961Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3193168Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3193272Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3193366Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3193463Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3193465Z 2025-12-04T13:44:26.3193698Z [rank2]:[W1204 13:41:09.963613373 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3193868Z [rank3]:[W1204 13:41:09.971482961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3194042Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3194297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3194458Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3194827Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3195032Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3195146Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3195242Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3195338Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3195339Z 2025-12-04T13:44:26.3195584Z [rank3]:[W1204 13:41:09.973905237 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3195764Z [rank1]:[W1204 13:41:09.628183069 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3195938Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3196207Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3196368Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3196734Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3196936Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3197041Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3197137Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3197234Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3197236Z 2025-12-04T13:44:26.3197469Z [rank1]:[W1204 13:41:09.630060438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3197683Z [rank2]:[W1204 13:41:10.963766585 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3197858Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3198114Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3198277Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3198646Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3198849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3198954Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3199065Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3199164Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3199165Z 2025-12-04T13:44:26.3199401Z [rank2]:[W1204 13:41:10.964996248 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3199599Z [rank3]:[W1204 13:41:10.974036830 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3199776Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3200031Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3200207Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3200576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3200781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3200883Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3200979Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3201075Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3201077Z 2025-12-04T13:44:26.3201310Z [rank3]:[W1204 13:41:10.975515857 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3201480Z [rank1]:[W1204 13:41:10.630250549 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3201657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3201911Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3202074Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3202441Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3202643Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3202748Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3202842Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3202939Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3202954Z 2025-12-04T13:44:26.3203187Z [rank1]:[W1204 13:41:10.631938012 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3203356Z [rank2]:[W1204 13:41:11.965145470 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3203555Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3203813Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3203997Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3204363Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3204566Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3204671Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3204765Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3204862Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3204864Z 2025-12-04T13:44:26.3205097Z [rank2]:[W1204 13:41:11.966472081 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3205267Z [rank3]:[W1204 13:41:11.975717408 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3205441Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3205701Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3205865Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3206234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3206435Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3206539Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3206634Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3206730Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3206733Z 2025-12-04T13:44:26.3206977Z [rank3]:[W1204 13:41:11.977018810 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3207149Z [rank1]:[W1204 13:41:11.632108934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3207323Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3207643Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3207805Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3208193Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3208394Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3208505Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3208599Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3208696Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3208699Z 2025-12-04T13:44:26.3208934Z [rank1]:[W1204 13:41:11.633372676 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3209104Z [rank2]:[W1204 13:41:12.966622133 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3209278Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3209535Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3209699Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3210067Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3210270Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3210374Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3210470Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3210566Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3210568Z 2025-12-04T13:44:26.3210804Z [rank2]:[W1204 13:41:12.967970433 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3210993Z [rank3]:[W1204 13:41:12.977128502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3211168Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3211435Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3211614Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3211980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3212192Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3212296Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3212394Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3212489Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3212491Z 2025-12-04T13:44:26.3212722Z [rank3]:[W1204 13:41:12.978308267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3212897Z [rank1]:[W1204 13:41:12.633551787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3213074Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3213331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3213494Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3213861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3214065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3214168Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3214262Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3214360Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3214362Z 2025-12-04T13:44:26.3214597Z [rank1]:[W1204 13:41:12.636190359 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3214779Z [rank2]:[W1204 13:41:13.968122215 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3214955Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3215212Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3215396Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3215765Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3215980Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3216083Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3216178Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3216277Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3216279Z 2025-12-04T13:44:26.3216511Z [rank2]:[W1204 13:41:13.969363368 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3216681Z [rank3]:[W1204 13:41:13.978465638 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3216855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3217109Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3217273Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3217693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3217895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3218000Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3218095Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3218190Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3218193Z 2025-12-04T13:44:26.3218425Z [rank3]:[W1204 13:41:13.979993825 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3218595Z [rank1]:[W1204 13:41:13.636356361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3218785Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3219041Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3219202Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3219609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3219825Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3219928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3220023Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3220118Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3220121Z 2025-12-04T13:44:26.3220355Z [rank1]:[W1204 13:41:13.639113781 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3220525Z [rank2]:[W1204 13:41:14.969517600 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3220702Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3220957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3221119Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3221489Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3221691Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3221797Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3221892Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3221989Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3221991Z 2025-12-04T13:44:26.3222223Z [rank2]:[W1204 13:41:14.970740313 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3222394Z [rank3]:[W1204 13:41:14.980064129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3222569Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3222836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3222999Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3223376Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3223589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3223706Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3223800Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3223897Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3223898Z 2025-12-04T13:44:26.3224131Z [rank3]:[W1204 13:41:14.981431659 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3224302Z [rank1]:[W1204 13:41:14.639287383 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3224475Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3224731Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3224891Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3225260Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3225463Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3225567Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3225662Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3225757Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3225759Z 2025-12-04T13:44:26.3225996Z [rank1]:[W1204 13:41:14.641601072 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3226169Z [rank2]:[W1204 13:41:15.970881046 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3226343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3226610Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3226773Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3227154Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3227366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3227525Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3227634Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3227731Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3227732Z 2025-12-04T13:44:26.3227966Z [rank2]:[W1204 13:41:15.972142298 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3228136Z [rank3]:[W1204 13:41:15.981574921 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3228312Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3228568Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3228733Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3229099Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3229301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3229405Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3229500Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3229596Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3229598Z 2025-12-04T13:44:26.3229829Z [rank3]:[W1204 13:41:15.982750546 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3230000Z [rank1]:[W1204 13:41:15.641715785 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3230177Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3230433Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3230615Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3230985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3231216Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3231319Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3231414Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3231520Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3231523Z 2025-12-04T13:44:26.3231757Z [rank1]:[W1204 13:41:15.643147394 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3231926Z [rank2]:[W1204 13:41:16.972421168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3232103Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3232357Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3232520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3232892Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3233094Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3233198Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3233293Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3233390Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3233392Z 2025-12-04T13:44:26.3233625Z [rank2]:[W1204 13:41:16.973884536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3233793Z [rank3]:[W1204 13:41:16.982898968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3233967Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3234224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3234388Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3234767Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3234973Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3235106Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3235201Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3235296Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3235309Z 2025-12-04T13:44:26.3235542Z [rank3]:[W1204 13:41:16.984312277 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3235712Z [rank1]:[W1204 13:41:16.643332255 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3235885Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3236141Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3236303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3236668Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3236869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3236976Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3237071Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3237167Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3237170Z 2025-12-04T13:44:26.3237402Z [rank1]:[W1204 13:41:16.645377000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3237616Z [rank2]:[W1204 13:41:17.974040928 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3237789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3238046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3238207Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3238600Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3238801Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3238905Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3239031Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3239128Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3239130Z 2025-12-04T13:44:26.3239365Z [rank2]:[W1204 13:41:17.976023254 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3239549Z [rank3]:[W1204 13:41:17.984436260 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3239723Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3239979Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3240143Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3240509Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3240711Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3240814Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3240911Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3241006Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3241008Z 2025-12-04T13:44:26.3241239Z [rank3]:[W1204 13:41:17.985593395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3241412Z [rank1]:[W1204 13:41:17.645583692 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3241585Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3241840Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3242004Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3242380Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3242582Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3242685Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3242782Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3242898Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3242901Z 2025-12-04T13:44:26.3243134Z [rank1]:[W1204 13:41:17.647759974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3243317Z [rank2]:[W1204 13:41:18.976205696 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3243490Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3243749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3243913Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3244279Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3244481Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3244586Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3244681Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3244778Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3244780Z 2025-12-04T13:44:26.3245014Z [rank2]:[W1204 13:41:18.977941458 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3245185Z [rank3]:[W1204 13:41:18.985706038 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3245361Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3245615Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3245784Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3246155Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3246367Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3246472Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3246567Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3246663Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3246677Z 2025-12-04T13:44:26.3246920Z [rank3]:[W1204 13:41:18.987381501 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3247091Z [rank1]:[W1204 13:41:18.647924216 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3247277Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3247573Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3247736Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3248107Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3248311Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3248415Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3248509Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3248604Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3248605Z 2025-12-04T13:44:26.3248840Z [rank1]:[W1204 13:41:18.650292104 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3249009Z [rank2]:[W1204 13:41:19.978091951 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3249183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3249439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3249600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3249971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3250174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3250299Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3250396Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3250491Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3250493Z 2025-12-04T13:44:26.3250741Z [rank2]:[W1204 13:41:19.979334264 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3250924Z [rank3]:[W1204 13:41:19.987530344 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3251099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3251367Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3251529Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3251897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3252100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3252208Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3252302Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3252400Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3252401Z 2025-12-04T13:44:26.3252637Z [rank3]:[W1204 13:41:19.989845543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3252810Z [rank1]:[W1204 13:41:19.650460846 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3252984Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3253241Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3253406Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3253775Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3253978Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3254082Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3254189Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3254285Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3254287Z 2025-12-04T13:44:26.3254522Z [rank1]:[W1204 13:41:19.652468582 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3255908Z [rank2]:[W1204 13:41:20.979478256 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3256085Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3256359Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3256521Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3256888Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3257092Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3257196Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3257293Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3257390Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3257392Z 2025-12-04T13:44:26.3257705Z [rank2]:[W1204 13:41:20.982125538 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3257877Z [rank3]:[W1204 13:41:20.989988266 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3258053Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3258309Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3258474Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3258841Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3259045Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3259149Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3259245Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3259358Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3259360Z 2025-12-04T13:44:26.3259592Z [rank3]:[W1204 13:41:20.991640480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3259761Z [rank1]:[W1204 13:41:20.652649595 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3259963Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3260222Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3260402Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3260768Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3260972Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3261074Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3261169Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3261265Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3261268Z 2025-12-04T13:44:26.3261501Z [rank1]:[W1204 13:41:20.654960224 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3261671Z [rank2]:[W1204 13:41:21.982260442 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3261847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3262102Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3262265Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3262635Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3262838Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3262943Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3263038Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3263135Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3263137Z 2025-12-04T13:44:26.3263380Z [rank2]:[W1204 13:41:21.983743149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3263550Z [rank3]:[W1204 13:41:21.991777863 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3263724Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3264001Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3264165Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3264558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3264761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3264869Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3264963Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3265059Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3265062Z 2025-12-04T13:44:26.3265296Z [rank3]:[W1204 13:41:21.993527485 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3265466Z [rank1]:[W1204 13:41:21.655105467 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3265639Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3265896Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3266059Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3266426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3266629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3266735Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3266830Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3266925Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3266929Z 2025-12-04T13:44:26.3267175Z [rank1]:[W1204 13:41:21.656501506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3267347Z [rank2]:[W1204 13:41:22.983882712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3267576Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3267847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3268021Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3268403Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3268605Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3268708Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3268805Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3268901Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3268903Z 2025-12-04T13:44:26.3269139Z [rank2]:[W1204 13:41:22.985135885 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3269310Z [rank3]:[W1204 13:41:22.993685977 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3269484Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3269739Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3269902Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3270272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3270473Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3270577Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3270672Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3270768Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3270770Z 2025-12-04T13:44:26.3271002Z [rank3]:[W1204 13:41:22.995709963 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3271193Z [rank1]:[W1204 13:41:22.656660459 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3271370Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3271625Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3271808Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3272177Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3272388Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3272491Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3272586Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3272684Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3272687Z 2025-12-04T13:44:26.3272918Z [rank1]:[W1204 13:41:22.658506429 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3273090Z [rank2]:[W1204 13:41:23.985286258 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3273263Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3273520Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3273684Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3274051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3274255Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3274358Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3274454Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3274553Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3274556Z 2025-12-04T13:44:26.3274789Z [rank2]:[W1204 13:41:23.986522960 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3274961Z [rank3]:[W1204 13:41:23.995817557 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3275148Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3275404Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3275581Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3275960Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3276171Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3276274Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3276369Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3276465Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3276468Z 2025-12-04T13:44:26.3276701Z [rank3]:[W1204 13:41:23.997010121 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3276871Z [rank1]:[W1204 13:41:23.658686521 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3277049Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3277303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3277466Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3277884Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3278090Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3278192Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3278289Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3278384Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3278387Z 2025-12-04T13:44:26.3278620Z [rank1]:[W1204 13:41:23.660514001 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3278791Z [rank2]:[W1204 13:41:24.986665124 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3278965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3279235Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3279397Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3279800Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3280005Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3280128Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3280223Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3280318Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3280320Z 2025-12-04T13:44:26.3280554Z [rank2]:[W1204 13:41:24.987913436 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3280723Z [rank3]:[W1204 13:41:24.997134714 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3280899Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3281155Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3281316Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3281684Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3281886Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3281993Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3282088Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3282184Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3282186Z 2025-12-04T13:44:26.3282419Z [rank3]:[W1204 13:41:24.999502772 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3282592Z [rank1]:[W1204 13:41:24.660669224 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3282767Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3283032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3283196Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3283572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3283785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3283899Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3283994Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3284091Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3284093Z 2025-12-04T13:44:26.3284328Z [rank1]:[W1204 13:41:24.662434925 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3284502Z [rank2]:[W1204 13:41:25.988053560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3284675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3284932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3285094Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3285462Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3285666Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3285770Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3285866Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3285962Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3285964Z 2025-12-04T13:44:26.3286197Z [rank2]:[W1204 13:41:25.989967048 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3286368Z [rank3]:[W1204 13:41:25.999628226 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3286546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3286803Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3286976Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3287344Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3287607Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3287711Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3287819Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3287916Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3287918Z 2025-12-04T13:44:26.3288150Z [rank3]:[W1204 13:41:25.001412137 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3288320Z [rank1]:[W1204 13:41:25.662585618 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3288496Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3288754Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3288918Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3289284Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3289488Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3289592Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3289688Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3289786Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3289788Z 2025-12-04T13:44:26.3290021Z [rank1]:[W1204 13:41:25.663826671 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3290193Z [rank2]:[W1204 13:41:26.990113681 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3290368Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3290625Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3290788Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3291176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3291390Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3291505Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3291601Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3291707Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3291709Z 2025-12-04T13:44:26.3291942Z [rank2]:[W1204 13:41:26.991980500 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3292111Z [rank3]:[W1204 13:41:26.001530351 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3292285Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3292540Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3292703Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3293077Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3293280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3293385Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3293479Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3293574Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3293579Z 2025-12-04T13:44:26.3293812Z [rank3]:[W1204 13:41:26.003796021 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3293982Z [rank1]:[W1204 13:41:26.663952385 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3294155Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3294411Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3294572Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3294950Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3295151Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3295281Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3295378Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3295475Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3295477Z 2025-12-04T13:44:26.3295719Z [rank1]:[W1204 13:41:26.665551520 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3295888Z [rank2]:[W1204 13:41:27.992139223 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3296061Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3296315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3296479Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3296848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3297049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3297152Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3297249Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3297345Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3297347Z 2025-12-04T13:44:26.3297621Z [rank2]:[W1204 13:41:27.994055241 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3297794Z [rank3]:[W1204 13:41:27.003947544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3297970Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3298230Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3298393Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3298774Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3298976Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3299081Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3299188Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3299298Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3299300Z 2025-12-04T13:44:26.3299532Z [rank3]:[W1204 13:41:27.005994000 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3299717Z [rank1]:[W1204 13:41:27.665688054 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3299891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3300145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3300308Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3300674Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3300876Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3300979Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3301073Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3301172Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3301173Z 2025-12-04T13:44:26.3301406Z [rank1]:[W1204 13:41:27.666920906 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3301577Z [rank2]:[W1204 13:41:28.994186895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3301750Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3302009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3302173Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3302542Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3302757Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3302860Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3302956Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3303052Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3303064Z 2025-12-04T13:44:26.3303308Z [rank2]:[W1204 13:41:28.995803660 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3303478Z [rank3]:[W1204 13:41:28.006396017 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3303667Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3303923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3304086Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3304457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3304660Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3304765Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3304860Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3304956Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3304959Z 2025-12-04T13:44:26.3305192Z [rank3]:[W1204 13:41:28.008764306 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3305363Z [rank1]:[W1204 13:41:28.667099779 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3305540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3305795Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3305959Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3306335Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3306547Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3306652Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3306746Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3306843Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3306845Z 2025-12-04T13:44:26.3307100Z [rank1]:[W1204 13:41:28.668877080 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3307271Z [rank2]:[W1204 13:41:29.995946043 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3307455Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3307749Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3307911Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3308281Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3308486Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3308592Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3308688Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3308784Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3308788Z 2025-12-04T13:44:26.3309020Z [rank2]:[W1204 13:41:29.997333413 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3309191Z [rank3]:[W1204 13:41:29.008897869 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3309366Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3309623Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3309784Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3310151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3310353Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3310458Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3310575Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3310673Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3310674Z 2025-12-04T13:44:26.3310907Z [rank3]:[W1204 13:41:29.010728769 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3311103Z [rank1]:[W1204 13:41:29.669027164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3311278Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3311551Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3311714Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3312079Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3312281Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3312386Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3312481Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3312578Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3312580Z 2025-12-04T13:44:26.3312816Z [rank1]:[W1204 13:41:29.670815475 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3312990Z [rank2]:[W1204 13:41:30.997794300 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3313164Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3313419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3313581Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3313951Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3314152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3314255Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3314351Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3314460Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3314462Z 2025-12-04T13:44:26.3314695Z [rank2]:[W1204 13:41:30.999240198 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3314875Z [rank3]:[W1204 13:41:30.010894222 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3315062Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3315319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3315493Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3315861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3316066Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3316170Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3316265Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3316364Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3316366Z 2025-12-04T13:44:26.3316601Z [rank3]:[W1204 13:41:30.012811480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3316772Z [rank1]:[W1204 13:41:30.670970078 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3316948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3317205Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3317370Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3317780Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3317984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3318088Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3318182Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3318279Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3318281Z 2025-12-04T13:44:26.3318526Z [rank1]:[W1204 13:41:30.672800848 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3318698Z [rank2]:[W1204 13:41:31.999401281 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3318898Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3319153Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3319328Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3319699Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3319903Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3320006Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3320101Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3320198Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3320200Z 2025-12-04T13:44:26.3320432Z [rank2]:[W1204 13:41:31.000670763 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3320601Z [rank3]:[W1204 13:41:31.012964684 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3320775Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3321032Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3321194Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3321566Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3321769Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3321875Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3321970Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3322065Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3322068Z 2025-12-04T13:44:26.3322311Z [rank3]:[W1204 13:41:31.014683226 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3322482Z [rank1]:[W1204 13:41:31.672942052 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3322656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3322930Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3323093Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3323476Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3323677Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3323782Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3323877Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3323973Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3323976Z 2025-12-04T13:44:26.3324210Z [rank1]:[W1204 13:41:31.674884379 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3324380Z [rank2]:[W1204 13:41:32.000826947 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3324552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3324807Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3324972Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3325342Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3325548Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3325650Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3325748Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3325844Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3325846Z 2025-12-04T13:44:26.3326081Z [rank2]:[W1204 13:41:32.002291615 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3326262Z [rank3]:[W1204 13:41:32.014811121 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3326437Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3326706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3326880Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3327247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3327458Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3327583Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3327678Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3327778Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3327780Z 2025-12-04T13:44:26.3328013Z [rank3]:[W1204 13:41:32.016809167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3328189Z [rank1]:[W1204 13:41:32.675052763 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3328365Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3328619Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3328784Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3329153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3329357Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3329460Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3329555Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3329654Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3329656Z 2025-12-04T13:44:26.3329888Z [rank1]:[W1204 13:41:32.676319905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3330062Z [rank2]:[W1204 13:41:33.002444849 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3330250Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3330510Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3330699Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3331066Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3331284Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3331388Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3331484Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3331579Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3331582Z 2025-12-04T13:44:26.3331817Z [rank2]:[W1204 13:41:33.003861038 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3331988Z [rank3]:[W1204 13:41:33.016974700 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3332164Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3332421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3332583Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3332952Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3333155Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3333258Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3333352Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3333448Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3333451Z 2025-12-04T13:44:26.3333684Z [rank3]:[W1204 13:41:33.019075684 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3333853Z [rank1]:[W1204 13:41:33.676456709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3334038Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3334293Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3334456Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3334847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3335060Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3335164Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3335260Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3335355Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3335357Z 2025-12-04T13:44:26.3335590Z [rank1]:[W1204 13:41:33.677695962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3335761Z [rank2]:[W1204 13:41:34.004040841 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3335935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3336190Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3336354Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3336720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3336924Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3337029Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3337126Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3337221Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3337223Z 2025-12-04T13:44:26.3337458Z [rank2]:[W1204 13:41:34.006273412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3337647Z [rank3]:[W1204 13:41:34.019216168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3337822Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3338094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3338255Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3338639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3338858Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3338976Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3339073Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3339170Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3339171Z 2025-12-04T13:44:26.3339406Z [rank3]:[W1204 13:41:34.021066048 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3339578Z [rank1]:[W1204 13:41:34.677861386 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3339753Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3340008Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3340172Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3340539Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3340742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3340850Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3340945Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3341042Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3341044Z 2025-12-04T13:44:26.3341283Z [rank1]:[W1204 13:41:34.679120488 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3341455Z [rank2]:[W1204 13:41:35.006420986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3341628Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3341891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3342055Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3342432Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3342644Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3342748Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3342859Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3342956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3342958Z 2025-12-04T13:44:26.3343191Z [rank2]:[W1204 13:41:35.007861054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3343364Z [rank3]:[W1204 13:41:35.021218522 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3343540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3343797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3343960Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3344327Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3344529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3344633Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3344729Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3344824Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3344826Z 2025-12-04T13:44:26.3345058Z [rank3]:[W1204 13:41:35.023337325 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3345228Z [rank1]:[W1204 13:41:35.679266532 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3345404Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3345659Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3345835Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3346205Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3346427Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3346532Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3346627Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3346734Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3346735Z 2025-12-04T13:44:26.3346968Z [rank1]:[W1204 13:41:35.680766739 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3347138Z [rank2]:[W1204 13:41:36.008025058 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3347314Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3347605Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3347772Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3348142Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3348345Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3348450Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3348547Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3348645Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3348648Z 2025-12-04T13:44:26.3348882Z [rank2]:[W1204 13:41:36.009379388 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3349052Z [rank3]:[W1204 13:41:36.023485619 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3349226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3349485Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3349647Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3350036Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3350239Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3350371Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3350467Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3350563Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3350578Z 2025-12-04T13:44:26.3350811Z [rank3]:[W1204 13:41:36.024833810 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3350979Z [rank1]:[W1204 13:41:36.680901064 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3351153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3351408Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3351569Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3351938Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3352138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3352243Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3352340Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3352437Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3352439Z 2025-12-04T13:44:26.3352672Z [rank1]:[W1204 13:41:36.682853591 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3352844Z [rank2]:[W1204 13:41:37.009529763 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3353017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3353273Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3353438Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3353814Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3354017Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3354120Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3354245Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3354343Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3354344Z 2025-12-04T13:44:26.3354584Z [rank2]:[W1204 13:41:37.011186136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3354767Z [rank3]:[W1204 13:41:37.024972814 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3354940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3355197Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3355359Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3355726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3355927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3356031Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3356129Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3356225Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3356227Z 2025-12-04T13:44:26.3356462Z [rank3]:[W1204 13:41:37.027345942 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3356634Z [rank1]:[W1204 13:41:37.682995435 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3356809Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3357063Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3357229Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3357631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3357845Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3357950Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3358044Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3358169Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3358171Z 2025-12-04T13:44:26.3358403Z [rank1]:[W1204 13:41:37.684239548 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3358588Z [rank2]:[W1204 13:41:38.011338081 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3358763Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3359022Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3359186Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3359551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3359756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3359859Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3359954Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3360050Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3360055Z 2025-12-04T13:44:26.3360289Z [rank2]:[W1204 13:41:38.012994354 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3360460Z [rank3]:[W1204 13:41:38.027477247 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3360635Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3360891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3361053Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3361422Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3361635Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3361738Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3361834Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3361929Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3361943Z 2025-12-04T13:44:26.3362186Z [rank3]:[W1204 13:41:38.029180510 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3362356Z [rank1]:[W1204 13:41:38.684371713 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3362546Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3362800Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3362962Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3363333Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3363536Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3363641Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3363735Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3363831Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3363832Z 2025-12-04T13:44:26.3364066Z [rank1]:[W1204 13:41:38.685582756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3364238Z [rank2]:[W1204 13:41:39.013818704 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3364414Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3364669Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3364832Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3365199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3365404Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3365521Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3365617Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3365713Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3365716Z 2025-12-04T13:44:26.3365959Z [rank2]:[W1204 13:41:39.016445756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3366140Z [rank3]:[W1204 13:41:39.029317374 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3366313Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3366582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3366744Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3367113Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3367315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3367419Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3367551Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3367648Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3367650Z 2025-12-04T13:44:26.3367887Z [rank3]:[W1204 13:41:39.030485479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3368059Z [rank1]:[W1204 13:41:39.685720181 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3368233Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3368489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3368652Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3369020Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3369221Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3369326Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3369439Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3369536Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3369539Z 2025-12-04T13:44:26.3369772Z [rank1]:[W1204 13:41:39.686950634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3369970Z [rank2]:[W1204 13:41:40.016620830 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3370146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3370399Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3370581Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3370949Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3371154Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3371258Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3371354Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3371451Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3371454Z 2025-12-04T13:44:26.3371687Z [rank2]:[W1204 13:41:40.018806332 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3371858Z [rank3]:[W1204 13:41:40.030679882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3372035Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3372290Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3372456Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3372822Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3373024Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3373128Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3373223Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3373319Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3373331Z 2025-12-04T13:44:26.3373564Z [rank3]:[W1204 13:41:40.032630669 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3373733Z [rank1]:[W1204 13:41:40.687425251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3373933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3374191Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3374369Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3374738Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3374939Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3375045Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3375139Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3375238Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3375240Z 2025-12-04T13:44:26.3375473Z [rank1]:[W1204 13:41:40.688880449 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3375644Z [rank2]:[W1204 13:41:41.018969146 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3375820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3376074Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3376240Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3376610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3376812Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3376917Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3377014Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3377109Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3377114Z 2025-12-04T13:44:26.3377355Z [rank2]:[W1204 13:41:41.020257128 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3377565Z [rank3]:[W1204 13:41:41.032765874 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3377739Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3378022Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3378183Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3378575Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3378778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3378883Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3378978Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3379073Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3379077Z 2025-12-04T13:44:26.3379310Z [rank3]:[W1204 13:41:41.033937518 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3379480Z [rank1]:[W1204 13:41:41.689064783 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3379657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3379912Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3380076Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3380445Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3380647Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3380751Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3380849Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3380946Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3380948Z 2025-12-04T13:44:26.3381180Z [rank1]:[W1204 13:41:41.690305916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3381365Z [rank2]:[W1204 13:41:42.020414373 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3381541Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3381804Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3381978Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3382346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3382567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3382670Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3382767Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3382864Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3382866Z 2025-12-04T13:44:26.3383098Z [rank2]:[W1204 13:41:42.022256082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3383270Z [rank3]:[W1204 13:41:42.034115952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3383447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3383702Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3383866Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3384234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3384436Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3384539Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3384634Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3384731Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3384733Z 2025-12-04T13:44:26.3384966Z [rank3]:[W1204 13:41:42.035383345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3385149Z [rank1]:[W1204 13:41:42.690472840 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3385325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3385578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3385764Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3386133Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3386348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3386454Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3386549Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3386647Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3386649Z 2025-12-04T13:44:26.3386880Z [rank1]:[W1204 13:41:42.691732843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3387053Z [rank2]:[W1204 13:41:43.022435606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3387231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3387529Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3387695Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3388063Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3388267Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3388370Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3388466Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3388564Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3388566Z 2025-12-04T13:44:26.3388799Z [rank2]:[W1204 13:41:43.023991882 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3388969Z [rank3]:[W1204 13:41:43.035767784 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3389159Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3389417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3389580Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3389973Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3390190Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3390293Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3390388Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3390483Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3390486Z 2025-12-04T13:44:26.3390719Z [rank3]:[W1204 13:41:43.037011237 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3390888Z [rank1]:[W1204 13:41:43.691880057 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3391063Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3391319Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3391481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3391853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3392054Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3392158Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3392253Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3392348Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3392350Z 2025-12-04T13:44:26.3392582Z [rank1]:[W1204 13:41:43.693122530 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3392753Z [rank2]:[W1204 13:41:44.024144367 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3392929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3393195Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3393359Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3393736Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3393950Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3394065Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3394161Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3394259Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3394261Z 2025-12-04T13:44:26.3394494Z [rank2]:[W1204 13:41:44.025547886 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3394666Z [rank3]:[W1204 13:41:44.037119213 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3394839Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3395097Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3395259Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3395629Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3395832Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3395936Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3396033Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3396130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3396132Z 2025-12-04T13:44:26.3396367Z [rank3]:[W1204 13:41:44.039170538 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3396539Z [rank1]:[W1204 13:41:44.693295754 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3396714Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3396981Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3397143Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3397570Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3397788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3397893Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3398001Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3398097Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3398098Z 2025-12-04T13:44:26.3402451Z [rank1]:[W1204 13:41:44.694833071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3402635Z [rank2]:[W1204 13:41:45.025723770 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3402812Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3403070Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3403238Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3403609Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3403813Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3403917Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3404014Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3404112Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3404115Z 2025-12-04T13:44:26.3404349Z [rank2]:[W1204 13:41:45.028023870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3404521Z [rank3]:[W1204 13:41:45.039305883 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3404699Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3404955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3405145Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3405515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3405740Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3405846Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3405941Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3406048Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3406051Z 2025-12-04T13:44:26.3406285Z [rank3]:[W1204 13:41:45.040927097 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3406455Z [rank1]:[W1204 13:41:45.694992525 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3406633Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3406891Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3407055Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3407421Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3407661Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3407766Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3407860Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3407958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3407960Z 2025-12-04T13:44:26.3408192Z [rank1]:[W1204 13:41:45.696274967 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3408363Z [rank2]:[W1204 13:41:46.028182015 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3408539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3408795Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3408960Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3409347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3409549Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3409680Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3409775Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3409871Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3409886Z 2025-12-04T13:44:26.3410119Z [rank2]:[W1204 13:41:46.030202830 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3410289Z [rank3]:[W1204 13:41:46.041051243 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3410462Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3410719Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3410881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3411251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3411455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3411560Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3411655Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3411749Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3411752Z 2025-12-04T13:44:26.3411986Z [rank3]:[W1204 13:41:46.043478989 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3412156Z [rank1]:[W1204 13:41:46.696439652 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3412330Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3412585Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3412746Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3413122Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3413323Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3413428Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3413544Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3413642Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3413644Z 2025-12-04T13:44:26.3413877Z [rank1]:[W1204 13:41:46.697913470 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3414060Z [rank2]:[W1204 13:41:47.030401574 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3414234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3414489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3414654Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3415019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3415223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3415327Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3415423Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3415520Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3415522Z 2025-12-04T13:44:26.3415756Z [rank2]:[W1204 13:41:47.032466319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3415927Z [rank3]:[W1204 13:41:47.043617615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3416102Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3416358Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3416521Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3416897Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3417100Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3417203Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3417298Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3417419Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3417422Z 2025-12-04T13:44:26.3417697Z [rank3]:[W1204 13:41:47.045043184 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3417888Z [rank1]:[W1204 13:41:47.698057995 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3418064Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3418318Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3418481Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3418848Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3419049Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3419154Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3419248Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3419346Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3419350Z 2025-12-04T13:44:26.3419584Z [rank1]:[W1204 13:41:47.699571982 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3419756Z [rank2]:[W1204 13:41:48.032638554 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3419933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3420191Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3420356Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3420727Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3420942Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3421046Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3421141Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3421237Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3421252Z 2025-12-04T13:44:26.3421496Z [rank2]:[W1204 13:41:48.034689419 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3421668Z [rank3]:[W1204 13:41:48.045167739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3421854Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3422115Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3422279Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3422647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3422851Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3422953Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3423049Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3423143Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3423145Z 2025-12-04T13:44:26.3423380Z [rank3]:[W1204 13:41:48.046917601 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3423548Z [rank1]:[W1204 13:41:48.699696607 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3423725Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3423980Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3424141Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3424512Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3424712Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3424826Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3424921Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3425017Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3425019Z 2025-12-04T13:44:26.3425262Z [rank1]:[W1204 13:41:48.700900071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3425444Z [rank2]:[W1204 13:41:49.034874193 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3425620Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3425885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3426047Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3426415Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3426623Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3426729Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3426823Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3426919Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3426921Z 2025-12-04T13:44:26.3427152Z [rank2]:[W1204 13:41:49.037272331 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3427325Z [rank3]:[W1204 13:41:49.047088366 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3427536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3427797Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3427958Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3428326Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3428528Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3428632Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3428744Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3428839Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3428841Z 2025-12-04T13:44:26.3429075Z [rank3]:[W1204 13:41:49.048852077 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3429271Z [rank1]:[W1204 13:41:49.701098735 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3429447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3429717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3429878Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3430244Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3430445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3430549Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3430645Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3430742Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3430744Z 2025-12-04T13:44:26.3430981Z [rank1]:[W1204 13:41:49.702657921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3431152Z [rank2]:[W1204 13:41:50.037435956 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3431327Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3431582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3431748Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3432116Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3432319Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3432424Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3432519Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3432643Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3432645Z 2025-12-04T13:44:26.3432877Z [rank2]:[W1204 13:41:50.039230896 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3433051Z [rank3]:[W1204 13:41:50.049006422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3433251Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3433507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3433682Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3434049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3434251Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3434354Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3434450Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3434545Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3434549Z 2025-12-04T13:44:26.3434781Z [rank3]:[W1204 13:41:50.050898091 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3434949Z [rank1]:[W1204 13:41:50.702817776 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3435127Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3435385Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3435549Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3435916Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3436116Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3436222Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3436316Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3436413Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3436415Z 2025-12-04T13:44:26.3436659Z [rank1]:[W1204 13:41:50.704081438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3436829Z [rank2]:[W1204 13:41:51.039382022 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3437004Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3437278Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3437455Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3437871Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3438072Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3438178Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3438273Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3438369Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3438372Z 2025-12-04T13:44:26.3438605Z [rank2]:[W1204 13:41:51.040611985 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3438774Z [rank3]:[W1204 13:41:51.051042066 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3438947Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3439204Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3439367Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3439735Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3439937Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3440041Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3440137Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3440232Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3440235Z 2025-12-04T13:44:26.3440484Z [rank3]:[W1204 13:41:51.052601282 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3440653Z [rank1]:[W1204 13:41:51.704253703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3440827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3441096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3441271Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3441655Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3441857Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3441962Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3442059Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3442155Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3442157Z 2025-12-04T13:44:26.3442392Z [rank1]:[W1204 13:41:51.706399076 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3442563Z [rank2]:[W1204 13:41:52.040797579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3442737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3442992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3443156Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3443521Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3443724Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3443827Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3443922Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3444020Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3444022Z 2025-12-04T13:44:26.3444255Z [rank2]:[W1204 13:41:52.043035350 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3444435Z [rank3]:[W1204 13:41:52.052751277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3444610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3444867Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3445049Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3445416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3445632Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3445735Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3445831Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3445928Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3445930Z 2025-12-04T13:44:26.3446165Z [rank3]:[W1204 13:41:52.055055347 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3446337Z [rank1]:[W1204 13:41:52.706553512 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3446510Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3446764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3446927Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3447294Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3447541Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3447645Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3447739Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3447835Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3447838Z 2025-12-04T13:44:26.3448071Z [rank1]:[W1204 13:41:52.708906010 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3448242Z [rank2]:[W1204 13:41:53.043196475 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3448433Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3448690Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3448866Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3449246Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3449464Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3449568Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3449662Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3449759Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3449762Z 2025-12-04T13:44:26.3449994Z [rank2]:[W1204 13:41:53.044876979 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3450164Z [rank3]:[W1204 13:41:53.055198862 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3450338Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3450596Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3450759Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3451125Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3451329Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3451432Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3451527Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3451621Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3451623Z 2025-12-04T13:44:26.3451859Z [rank3]:[W1204 13:41:53.057080351 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3452028Z [rank1]:[W1204 13:41:53.709056656 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3452203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3452466Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3452628Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3453019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3453219Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3453337Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3453432Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3453526Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3453528Z 2025-12-04T13:44:26.3453762Z [rank1]:[W1204 13:41:53.711465303 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3453932Z [rank2]:[W1204 13:41:54.045058664 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3454108Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3454363Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3454525Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3454894Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3455097Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3455203Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3455298Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3455394Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3455396Z 2025-12-04T13:44:26.3455628Z [rank2]:[W1204 13:41:54.046726377 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3455800Z [rank3]:[W1204 13:41:54.057225527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3455973Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3456242Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3456405Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3456781Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3456995Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3457108Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3457204Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3457299Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3457301Z 2025-12-04T13:44:26.3457581Z [rank3]:[W1204 13:41:54.058935309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3457754Z [rank1]:[W1204 13:41:54.711661157 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3457928Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3458183Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3458344Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3458711Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3458912Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3459016Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3459112Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3459207Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3459209Z 2025-12-04T13:44:26.3459448Z [rank1]:[W1204 13:41:54.714019826 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3459616Z [rank2]:[W1204 13:41:55.046886492 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3459793Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3460049Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3460225Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3460595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3460834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3460938Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3461048Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3461145Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3461147Z 2025-12-04T13:44:26.3461379Z [rank2]:[W1204 13:41:55.048803700 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3461551Z [rank3]:[W1204 13:41:55.059053986 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3461728Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3461985Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3462148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3462515Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3462718Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3462822Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3462917Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3463014Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3463015Z 2025-12-04T13:44:26.3463249Z [rank3]:[W1204 13:41:55.060528123 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3463419Z [rank1]:[W1204 13:41:55.714176591 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3463596Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3463857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3464019Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3464400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3464611Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3464725Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3464820Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3464929Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3464932Z 2025-12-04T13:44:26.3465165Z [rank1]:[W1204 13:41:55.716566209 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3465334Z [rank2]:[W1204 13:41:56.048975296 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3465507Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3465762Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3465926Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3466298Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3466498Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3466605Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3466700Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3466797Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3466799Z 2025-12-04T13:44:26.3467032Z [rank2]:[W1204 13:41:56.050683278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3467202Z [rank3]:[W1204 13:41:56.060694289 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3467375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3467671Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3467835Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3468219Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3468421Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3468550Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3468646Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3468741Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3468743Z 2025-12-04T13:44:26.3468992Z [rank3]:[W1204 13:41:56.063120306 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3469162Z [rank1]:[W1204 13:41:56.716739144 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3469336Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3469592Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3469754Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3470123Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3470325Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3470430Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3470529Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3470625Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3470627Z 2025-12-04T13:44:26.3470859Z [rank1]:[W1204 13:41:56.718779120 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3471029Z [rank2]:[W1204 13:41:57.050823944 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3471203Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3471457Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3471621Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3472000Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3472203Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3472307Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3472412Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3472518Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3472520Z 2025-12-04T13:44:26.3472753Z [rank2]:[W1204 13:41:57.052325551 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3472936Z [rank3]:[W1204 13:41:57.063279941 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3473110Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3473367Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3473532Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3473898Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3474101Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3474204Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3474299Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3474395Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3474398Z 2025-12-04T13:44:26.3474632Z [rank3]:[W1204 13:41:57.065019683 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3474807Z [rank1]:[W1204 13:41:57.718914996 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3474980Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3475234Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3475397Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3475763Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3475975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3476078Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3476175Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3476270Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3476298Z 2025-12-04T13:44:26.3476532Z [rank1]:[W1204 13:41:57.721216235 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3476701Z [rank2]:[W1204 13:41:58.052464777 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3476888Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3477145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3477307Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3477718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3477923Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3478027Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3478122Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3478218Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3478221Z 2025-12-04T13:44:26.3478453Z [rank2]:[W1204 13:41:58.055046471 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3478623Z [rank3]:[W1204 13:41:58.065160059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3478800Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3479056Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3479223Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3479592Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3479808Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3479912Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3480007Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3480102Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3480105Z 2025-12-04T13:44:26.3480362Z [rank3]:[W1204 13:41:58.067341131 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3480534Z [rank1]:[W1204 13:41:58.721411470 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3480720Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3480976Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3481137Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3481507Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3481710Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3481815Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3481909Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3482004Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3482005Z 2025-12-04T13:44:26.3482249Z [rank1]:[W1204 13:41:58.723818127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3482419Z [rank2]:[W1204 13:41:59.055201817 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3482595Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3482850Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3483015Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3483386Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3483592Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3483697Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3483801Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3483899Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3483901Z 2025-12-04T13:44:26.3484134Z [rank2]:[W1204 13:41:59.056922519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3484324Z [rank3]:[W1204 13:41:59.067501207 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3484499Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3484766Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3484928Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3485295Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3485497Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3485601Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3485698Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3485795Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3485798Z 2025-12-04T13:44:26.3486029Z [rank3]:[W1204 13:41:59.069877615 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3486200Z [rank1]:[W1204 13:41:59.723970314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3486373Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3486627Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3486789Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3487157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3487360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3487463Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3487608Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3487718Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3487720Z 2025-12-04T13:44:26.3487956Z [rank1]:[W1204 13:41:59.726105347 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3488137Z [rank2]:[W1204 13:42:00.057063185 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3488325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3488582Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3488761Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3489130Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3489332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3489436Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3489532Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3489629Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3489631Z 2025-12-04T13:44:26.3489863Z [rank2]:[W1204 13:42:00.058335107 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3490037Z [rank3]:[W1204 13:42:00.070016671 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3490215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3490469Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3490633Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3491002Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3491207Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3491309Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3491406Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3491503Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3491505Z 2025-12-04T13:44:26.3491756Z [rank3]:[W1204 13:42:00.072457028 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3491928Z [rank1]:[W1204 13:42:00.726248293 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3492123Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3492380Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3492551Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3492921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3493125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3493229Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3493325Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3493420Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3493422Z 2025-12-04T13:44:26.3493657Z [rank1]:[W1204 13:42:00.728577412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3493827Z [rank2]:[W1204 13:42:01.058448634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3494008Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3494263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3494425Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3494795Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3494996Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3495102Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3495197Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3495292Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3495295Z 2025-12-04T13:44:26.3495539Z [rank2]:[W1204 13:42:01.059740536 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3495711Z [rank3]:[W1204 13:42:01.072572295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3495886Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3496163Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3496326Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3496704Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3496906Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3497012Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3497108Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3497204Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3497206Z 2025-12-04T13:44:26.3497447Z [rank3]:[W1204 13:42:01.074321696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3497655Z [rank1]:[W1204 13:42:01.728731518 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3497829Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3498089Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3498251Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3498621Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3498824Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3498927Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3499024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3499120Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3499122Z 2025-12-04T13:44:26.3499358Z [rank1]:[W1204 13:42:01.730709504 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3499543Z [rank2]:[W1204 13:42:02.059864912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3499718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3499987Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3500164Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3500536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3500751Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3500855Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3500952Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3501049Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3501051Z 2025-12-04T13:44:26.3501284Z [rank2]:[W1204 13:42:02.061565345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3501456Z [rank3]:[W1204 13:42:02.074478312 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3501630Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3501885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3502048Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3502418Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3502624Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3502727Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3502822Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3502919Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3502921Z 2025-12-04T13:44:26.3503154Z [rank3]:[W1204 13:42:02.076657825 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3503338Z [rank1]:[W1204 13:42:02.730901500 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3503513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3503768Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3503953Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3504321Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3504539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3504643Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3504739Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3504834Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3504837Z 2025-12-04T13:44:26.3505070Z [rank1]:[W1204 13:42:02.732912646 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3505238Z [rank2]:[W1204 13:42:03.061677012 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3505414Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3505668Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3505831Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3506199Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3506406Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3506510Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3506606Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3506703Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3506706Z 2025-12-04T13:44:26.3506940Z [rank2]:[W1204 13:42:03.063452083 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3507111Z [rank3]:[W1204 13:42:03.076822401 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3507296Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3507598Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3507760Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3508157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3508372Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3508477Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3508571Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3508668Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3508670Z 2025-12-04T13:44:26.3508902Z [rank3]:[W1204 13:42:03.078899145 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3509074Z [rank1]:[W1204 13:42:03.733111791 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3509251Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3509507Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3509668Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3510034Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3510235Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3510339Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3510434Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3510528Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3510530Z 2025-12-04T13:44:26.3510763Z [rank1]:[W1204 13:42:03.734543230 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3510932Z [rank2]:[W1204 13:42:04.063588520 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3511107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3511381Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3511544Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3511922Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3512133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3512247Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3512342Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3512440Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3512441Z 2025-12-04T13:44:26.3512674Z [rank2]:[W1204 13:42:04.065700814 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3512844Z [rank3]:[W1204 13:42:04.079061611 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3513019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3513278Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3513441Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3513807Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3514009Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3514113Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3514208Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3514304Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3514306Z 2025-12-04T13:44:26.3514537Z [rank3]:[W1204 13:42:04.081270703 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3514709Z [rank1]:[W1204 13:42:04.734684836 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3514882Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3515149Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3515311Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3515693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3515904Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3516007Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3516113Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3516209Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3516211Z 2025-12-04T13:44:26.3516444Z [rank1]:[W1204 13:42:04.737577233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3516614Z [rank2]:[W1204 13:42:05.065818541 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3516789Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3517043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3517207Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3517611Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3517817Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3517921Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3518017Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3518115Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3518117Z 2025-12-04T13:44:26.3518351Z [rank2]:[W1204 13:42:05.067628151 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3518520Z [rank3]:[W1204 13:42:05.081367400 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3518697Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3518951Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3519129Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3519495Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3519733Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3519840Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3519935Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3520044Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3520046Z 2025-12-04T13:44:26.3520283Z [rank3]:[W1204 13:42:05.083889285 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3520454Z [rank1]:[W1204 13:42:05.737720919 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3520630Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3520885Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3521047Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3521416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3521617Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3521722Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3521817Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3521912Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3521914Z 2025-12-04T13:44:26.3522153Z [rank1]:[W1204 13:42:05.739913221 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3522322Z [rank2]:[W1204 13:42:06.067789588 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3522499Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3522756Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3522918Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3523298Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3523501Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3523627Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3523722Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3523818Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3523831Z 2025-12-04T13:44:26.3524064Z [rank2]:[W1204 13:42:06.069030210 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3524235Z [rank3]:[W1204 13:42:06.084005882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3524413Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3524670Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3524833Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3525204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3525407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3525515Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3525610Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3525705Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3525707Z 2025-12-04T13:44:26.3525940Z [rank3]:[W1204 13:42:06.086386110 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3526111Z [rank1]:[W1204 13:42:06.740101277 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3526284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3526542Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3526704Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3527081Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3527284Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3527388Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3527551Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3527647Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3527649Z 2025-12-04T13:44:26.3527881Z [rank1]:[W1204 13:42:06.742352798 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3528068Z [rank2]:[W1204 13:42:07.069725655 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3528245Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3528503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3528667Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3529043Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3529246Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3529350Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3529445Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3529544Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3529546Z 2025-12-04T13:44:26.3529779Z [rank2]:[W1204 13:42:07.071744421 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3529953Z [rank3]:[W1204 13:42:07.086544097 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3530129Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3530384Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3530548Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3530921Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3531138Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3531243Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3531338Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3531466Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3531468Z 2025-12-04T13:44:26.3531701Z [rank3]:[W1204 13:42:07.088830907 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3531884Z [rank1]:[W1204 13:42:07.742533514 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3532058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3532314Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3532479Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3532844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3533050Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3533154Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3533249Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3533345Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3533348Z 2025-12-04T13:44:26.3533585Z [rank1]:[W1204 13:42:07.744823384 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3533754Z [rank2]:[W1204 13:42:08.071938187 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3533930Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3534185Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3534348Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3534718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3534930Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3535034Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3535129Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3535226Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3535245Z 2025-12-04T13:44:26.3535494Z [rank2]:[W1204 13:42:08.073687158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3535665Z [rank3]:[W1204 13:42:08.088991393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3535852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3536106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3536269Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3536636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3536839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3536943Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3537037Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3537134Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3537136Z 2025-12-04T13:44:26.3537370Z [rank3]:[W1204 13:42:08.091013119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3537578Z [rank1]:[W1204 13:42:08.745313213 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3537753Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3538009Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3538172Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3538540Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3538742Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3538861Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3538957Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3539051Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3539053Z 2025-12-04T13:44:26.3539300Z [rank1]:[W1204 13:42:08.747951045 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3539483Z [rank2]:[W1204 13:42:09.073782286 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3539657Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3539929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3540091Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3540457Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3540658Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3540764Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3540859Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3540956Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3540958Z 2025-12-04T13:44:26.3541192Z [rank2]:[W1204 13:42:09.075831611 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3541363Z [rank3]:[W1204 13:42:09.091135666 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3541537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3541794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3541959Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3542327Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3542530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3542636Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3542742Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3542840Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3542842Z 2025-12-04T13:44:26.3543074Z [rank3]:[W1204 13:42:09.093844477 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3543267Z [rank1]:[W1204 13:42:09.748130521 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3543442Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3543698Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3543874Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3544242Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3544446Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3544549Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3544646Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3544742Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3544744Z 2025-12-04T13:44:26.3544977Z [rank1]:[W1204 13:42:09.750490770 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3545147Z [rank2]:[W1204 13:42:10.076019807 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3545322Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3545578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3545742Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3546109Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3546315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3546420Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3546516Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3546623Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3546625Z 2025-12-04T13:44:26.3546858Z [rank2]:[W1204 13:42:10.077980494 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3547027Z [rank3]:[W1204 13:42:10.093979174 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3547223Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3547523Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3547700Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3548064Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3548268Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3548373Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3548468Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3548567Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3548569Z 2025-12-04T13:44:26.3548801Z [rank3]:[W1204 13:42:10.096022149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3548971Z [rank1]:[W1204 13:42:10.750647156 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3549146Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3549401Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3549564Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3549930Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3550133Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3550239Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3550335Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3550430Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3550433Z 2025-12-04T13:44:26.3550691Z [rank1]:[W1204 13:42:10.751922768 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3550865Z [rank2]:[W1204 13:42:11.078162391 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3551040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3551322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3551485Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3551868Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3552069Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3552174Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3552271Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3552368Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3552371Z 2025-12-04T13:44:26.3552605Z [rank2]:[W1204 13:42:11.079778855 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3552775Z [rank3]:[W1204 13:42:11.096190356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3552951Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3553208Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3553371Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3553742Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3553943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3554047Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3554143Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3554238Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3554240Z 2025-12-04T13:44:26.3554472Z [rank3]:[W1204 13:42:11.098085314 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3554655Z [rank1]:[W1204 13:42:11.752078515 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3554829Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3555098Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3555271Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3555636Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3555848Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3555950Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3556048Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3556143Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3556145Z 2025-12-04T13:44:26.3556377Z [rank1]:[W1204 13:42:11.754535721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3556549Z [rank2]:[W1204 13:42:12.079936632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3556721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3556977Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3557142Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3557550Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3557752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3557858Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3557953Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3558052Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3558053Z 2025-12-04T13:44:26.3558286Z [rank2]:[W1204 13:42:12.081598136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3558470Z [rank3]:[W1204 13:42:12.098243961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3558645Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3558900Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3559091Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3559464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3559679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3559782Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3559876Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3559973Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3559975Z 2025-12-04T13:44:26.3560207Z [rank3]:[W1204 13:42:12.100529401 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3560381Z [rank1]:[W1204 13:42:12.754667949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3560555Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3560809Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3560973Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3561338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3561543Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3561649Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3561745Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3561841Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3561844Z 2025-12-04T13:44:26.3562077Z [rank1]:[W1204 13:42:12.757039467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3562248Z [rank2]:[W1204 13:42:13.081748403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3562433Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3562688Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3562850Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3563241Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3563460Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3563564Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3563661Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3563757Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3563760Z 2025-12-04T13:44:26.3563996Z [rank2]:[W1204 13:42:13.082989865 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3564166Z [rank3]:[W1204 13:42:13.100677818 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3564343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3564599Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3564761Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3565130Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3565332Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3565436Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3565531Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3565626Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3565628Z 2025-12-04T13:44:26.3565865Z [rank3]:[W1204 13:42:13.102851870 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3566039Z [rank1]:[W1204 13:42:13.757162615 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3566215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3566481Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3566647Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3567030Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3567244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3567360Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3567455Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3567579Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3567583Z 2025-12-04T13:44:26.3567815Z [rank1]:[W1204 13:42:13.759540713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3567988Z [rank2]:[W1204 13:42:14.083136173 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3568161Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3568419Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3568579Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3568948Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3569152Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3569256Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3569353Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3569450Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3569452Z 2025-12-04T13:44:26.3569684Z [rank2]:[W1204 13:42:14.085273956 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3569854Z [rank3]:[W1204 13:42:14.102975518 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3570028Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3570298Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3570463Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3570844Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3571059Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3571178Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3571272Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3571368Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3571371Z 2025-12-04T13:44:26.3571605Z [rank3]:[W1204 13:42:14.104976974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3571778Z [rank1]:[W1204 13:42:14.759696560 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3571953Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3572206Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3572372Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3572741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3572945Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3573049Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3573145Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3573241Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3573245Z 2025-12-04T13:44:26.3573476Z [rank1]:[W1204 13:42:14.761911731 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3573646Z [rank2]:[W1204 13:42:15.085448893 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3573820Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3574075Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3574249Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3574616Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3574842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3574945Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3575041Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3575147Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3575150Z 2025-12-04T13:44:26.3575383Z [rank2]:[W1204 13:42:15.087684204 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3575554Z [rank3]:[W1204 13:42:15.105116472 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3575730Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3575984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3576148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3576514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3576715Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3576820Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3576915Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3577012Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3577014Z 2025-12-04T13:44:26.3577246Z [rank3]:[W1204 13:42:15.107089678 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3577418Z [rank1]:[W1204 13:42:15.762102718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3577630Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3577884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3578047Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3578428Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3578630Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3578764Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3578860Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3578955Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3578972Z 2025-12-04T13:44:26.3579208Z [rank1]:[W1204 13:42:15.764064134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3579378Z [rank2]:[W1204 13:42:16.087853971 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3579552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3579808Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3579970Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3580338Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3580539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3580644Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3580740Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3580835Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3580838Z 2025-12-04T13:44:26.3581072Z [rank2]:[W1204 13:42:16.089972364 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3581243Z [rank3]:[W1204 13:42:16.107248485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3581418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3581676Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3581839Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3582217Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3582419Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3582523Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3582640Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3582736Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3582738Z 2025-12-04T13:44:26.3582969Z [rank3]:[W1204 13:42:16.109258631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3583154Z [rank1]:[W1204 13:42:16.764216812 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3583329Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3583586Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3583751Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3584122Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3584326Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3584429Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3584527Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3584623Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3584626Z 2025-12-04T13:44:26.3584859Z [rank1]:[W1204 13:42:16.765467934 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3585031Z [rank2]:[W1204 13:42:17.090105052 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3585204Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3585458Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3585622Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3586004Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3586207Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3586310Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3586405Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3586524Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3586527Z 2025-12-04T13:44:26.3586760Z [rank2]:[W1204 13:42:17.092560518 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3586942Z [rank3]:[W1204 13:42:17.109401559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3587117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3587371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3587576Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3587946Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3588150Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3588257Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3588351Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3588450Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3588453Z 2025-12-04T13:44:26.3588686Z [rank3]:[W1204 13:42:17.111425825 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3588858Z [rank1]:[W1204 13:42:17.765637891 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3589034Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3589288Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3589451Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3589820Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3590039Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3590144Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3590240Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3590335Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3590351Z 2025-12-04T13:44:26.3590596Z [rank1]:[W1204 13:42:17.767939271 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3590769Z [rank2]:[W1204 13:42:18.092730755 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3590959Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3591214Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3591376Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3591747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3591951Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3592055Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3592151Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3592247Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3592250Z 2025-12-04T13:44:26.3592486Z [rank2]:[W1204 13:42:18.094660623 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3592655Z [rank3]:[W1204 13:42:18.111549333 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3592831Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3593087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3593250Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3593621Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3593823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3593939Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3594034Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3594130Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3594132Z 2025-12-04T13:44:26.3594382Z [rank3]:[W1204 13:42:18.112704007 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3594566Z [rank1]:[W1204 13:42:18.768108878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3594742Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3595008Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3595170Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3595536Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3595738Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3595842Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3595938Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3596034Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3596036Z 2025-12-04T13:44:26.3596268Z [rank1]:[W1204 13:42:18.769849510 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3596440Z [rank2]:[W1204 13:42:19.094852089 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3596615Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3596873Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3597034Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3597402Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3597642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3597745Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3597855Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3597951Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3597953Z 2025-12-04T13:44:26.3598186Z [rank2]:[W1204 13:42:19.097015572 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3598387Z [rank3]:[W1204 13:42:19.112844675 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3598564Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3598837Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3599000Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3599367Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3599568Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3599671Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3599766Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3599863Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3599865Z 2025-12-04T13:44:26.3600098Z [rank3]:[W1204 13:42:19.114027249 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3600268Z [rank1]:[W1204 13:42:19.770006757 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3600447Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3600700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3600864Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3601233Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3601437Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3601541Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3601636Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3601745Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3601747Z 2025-12-04T13:44:26.3601978Z [rank1]:[W1204 13:42:19.771266040 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3602148Z [rank2]:[W1204 13:42:20.097164640 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3602343Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3602600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3602775Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3603142Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3603348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3603454Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3603550Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3603646Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3603649Z 2025-12-04T13:44:26.3603882Z [rank2]:[W1204 13:42:20.098395863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3604051Z [rank3]:[W1204 13:42:20.114156387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3604228Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3604482Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3604646Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3605013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3605215Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3605321Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3605417Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3605514Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3605516Z 2025-12-04T13:44:26.3605760Z [rank3]:[W1204 13:42:20.115347191 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3605933Z [rank1]:[W1204 13:42:20.771420787 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3606107Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3606383Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3606558Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3606923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3607125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3607231Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3607326Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3607425Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3607428Z 2025-12-04T13:44:26.3607707Z [rank1]:[W1204 13:42:20.772684630 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3607877Z [rank2]:[W1204 13:42:21.098619129 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3608050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3608306Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3608469Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3608840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3609042Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3609147Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3609243Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3609338Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3609341Z 2025-12-04T13:44:26.3609598Z [rank2]:[W1204 13:42:21.100450919 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3609768Z [rank3]:[W1204 13:42:21.115508749 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3609944Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3610216Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3610391Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3610775Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3610977Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3611082Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3611180Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3611279Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3611281Z 2025-12-04T13:44:26.3611514Z [rank3]:[W1204 13:42:21.117134613 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3611686Z [rank1]:[W1204 13:42:21.772874527 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3611861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3612118Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3612282Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3612646Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3612848Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3612952Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3613049Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3613145Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3613147Z 2025-12-04T13:44:26.3613378Z [rank1]:[W1204 13:42:21.775296864 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3613561Z [rank2]:[W1204 13:42:22.100592487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3613736Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3613990Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3614177Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3614548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3614761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3614863Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3614959Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3615056Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3615058Z 2025-12-04T13:44:26.3615292Z [rank2]:[W1204 13:42:22.101829930 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3615462Z [rank3]:[W1204 13:42:22.117257062 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3615636Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3615892Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3616055Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3616424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3616627Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3616730Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3616824Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3616921Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3616925Z 2025-12-04T13:44:26.3617157Z [rank3]:[W1204 13:42:22.118429846 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3617328Z [rank1]:[W1204 13:42:22.775444361 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3617545Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3617802Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3617977Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3618356Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3618573Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3618678Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3618773Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3618869Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3618872Z 2025-12-04T13:44:26.3619104Z [rank1]:[W1204 13:42:22.777844339 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3619274Z [rank2]:[W1204 13:42:23.101965778 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3619449Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3619705Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3619866Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3620234Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3620438Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3620540Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3620637Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3620733Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3620734Z 2025-12-04T13:44:26.3620970Z [rank2]:[W1204 13:42:23.103203681 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3621141Z [rank3]:[W1204 13:42:23.118541745 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3621317Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3621586Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3621747Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3622139Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3622339Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3622463Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3622558Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3622657Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3622659Z 2025-12-04T13:44:26.3622896Z [rank3]:[W1204 13:42:23.119722349 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3623067Z [rank1]:[W1204 13:42:23.777998587 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3623241Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3623497Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3623660Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3624028Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3624230Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3624335Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3624429Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3624526Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3624528Z 2025-12-04T13:44:26.3624760Z [rank1]:[W1204 13:42:23.780432713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3624932Z [rank2]:[W1204 13:42:24.103334259 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3625106Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3625375Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3625539Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3625914Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3626128Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3626241Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3626339Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3626434Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3626436Z 2025-12-04T13:44:26.3626671Z [rank2]:[W1204 13:42:24.104544022 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3626840Z [rank3]:[W1204 13:42:24.119807428 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3627016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3627274Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3627436Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3627840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3628041Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3628144Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3628240Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3628336Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3628338Z 2025-12-04T13:44:26.3628571Z [rank3]:[W1204 13:42:24.121051591 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3628742Z [rank1]:[W1204 13:42:24.780589801 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3628917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3629170Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3629347Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3629720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3629948Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3630053Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3630165Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3630261Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3630263Z 2025-12-04T13:44:26.3630494Z [rank1]:[W1204 13:42:24.782914380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3630665Z [rank2]:[W1204 13:42:25.104668291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3630840Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3631096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3631261Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3631630Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3631836Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3631939Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3632035Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3632131Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3632134Z 2025-12-04T13:44:26.3632367Z [rank2]:[W1204 13:42:25.105971762 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3632536Z [rank3]:[W1204 13:42:25.121176609 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3632710Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3632968Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3633132Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3633513Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3633725Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3633842Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3633937Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3634044Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3634046Z 2025-12-04T13:44:26.3634281Z [rank3]:[W1204 13:42:25.122333464 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3634449Z [rank1]:[W1204 13:42:25.783100227 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3634624Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3634878Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3635041Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3635408Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3635610Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3635717Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3635811Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3635907Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3635910Z 2025-12-04T13:44:26.3636144Z [rank1]:[W1204 13:42:25.785459216 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3636316Z [rank2]:[W1204 13:42:26.106103861 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3636489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3636746Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3636909Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3637289Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3637531Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3637671Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3637769Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3637866Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3637882Z 2025-12-04T13:44:26.3638117Z [rank2]:[W1204 13:42:26.107721925 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3638289Z [rank3]:[W1204 13:42:26.122447533 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3638463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3638720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3638881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3639248Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3639450Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3639554Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3639650Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3639746Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3639748Z 2025-12-04T13:44:26.3639984Z [rank3]:[W1204 13:42:26.123627807 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3640155Z [rank1]:[W1204 13:42:26.785620634 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3640331Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3640590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3640755Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3641134Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3641337Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3641440Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3641545Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3641653Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3641656Z 2025-12-04T13:44:26.3641887Z [rank1]:[W1204 13:42:26.787818206 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3642069Z [rank2]:[W1204 13:42:27.107874994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3642242Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3642499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3642664Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3643035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3643238Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3643342Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3643439Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3643535Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3643537Z 2025-12-04T13:44:26.3643774Z [rank2]:[W1204 13:42:27.109123696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3643947Z [rank3]:[W1204 13:42:27.123760305 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3644120Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3644375Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3644538Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3644910Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3645125Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3645229Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3645325Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3645421Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3645444Z 2025-12-04T13:44:26.3645678Z [rank3]:[W1204 13:42:27.124932250 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3645846Z [rank1]:[W1204 13:42:27.787997623 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3646034Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3646288Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3646451Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3646818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3647025Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3647132Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3647226Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3647321Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3647324Z 2025-12-04T13:44:26.3647591Z [rank1]:[W1204 13:42:27.789936201 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3647763Z [rank2]:[W1204 13:42:28.109282474 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3647940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3648195Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3648357Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3648726Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3648943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3649046Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3649142Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3649238Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3649240Z 2025-12-04T13:44:26.3649503Z [rank2]:[W1204 13:42:28.110507477 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3649675Z [rank3]:[W1204 13:42:28.125095168 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3653842Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3654105Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3654268Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3654640Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3654841Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3654949Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3655044Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3655141Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3655143Z 2025-12-04T13:44:26.3655378Z [rank3]:[W1204 13:42:28.126877319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3655550Z [rank1]:[W1204 13:42:28.790086489 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3655726Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3655983Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3656147Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3656513Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3656716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3656822Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3656929Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3657026Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3657028Z 2025-12-04T13:44:26.3657260Z [rank1]:[W1204 13:42:28.792431557 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3657463Z [rank2]:[W1204 13:42:29.110634866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3657661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3657940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3658104Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3658471Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3658676Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3658781Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3658876Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3658972Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3658973Z 2025-12-04T13:44:26.3659207Z [rank2]:[W1204 13:42:29.111957837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3659379Z [rank3]:[W1204 13:42:29.127035517 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3659552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3659814Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3659978Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3660347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3660550Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3660656Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3660751Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3660860Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3660862Z 2025-12-04T13:44:26.3661097Z [rank3]:[W1204 13:42:29.128327309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3661595Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.3661659Z current_size = base.storage().size() 2025-12-04T13:44:26.3662146Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.3662193Z current_size = base.storage().size() 2025-12-04T13:44:26.3662667Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T13:44:26.3662712Z current_size = base.storage().size() 2025-12-04T13:44:26.3662883Z [rank1]:[W1204 13:42:29.792598376 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3663059Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3663315Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3663477Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3663847Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3664051Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3664156Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3664251Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3664347Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3664351Z 2025-12-04T13:44:26.3664586Z [rank1]:[W1204 13:42:29.794842816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3664758Z [rank2]:[W1204 13:42:30.112101786 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3664946Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3665201Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3665364Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3665752Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3665963Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3666068Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3666163Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3666261Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3666263Z 2025-12-04T13:44:26.3666497Z [rank2]:[W1204 13:42:30.113803258 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3666669Z [rank3]:[W1204 13:42:30.128442978 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3666846Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3667101Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3667265Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3667656Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3667860Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3667965Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3668061Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3668157Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3668158Z 2025-12-04T13:44:26.3668390Z [rank3]:[W1204 13:42:30.129642941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3668561Z [rank1]:[W1204 13:42:30.794993485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3668735Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3669007Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3669169Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3669548Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3669765Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3669882Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3669977Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3670073Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3670075Z 2025-12-04T13:44:26.3670308Z [rank1]:[W1204 13:42:30.797443931 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3670352Z Autotune Choices Stats: 2025-12-04T13:44:26.3670827Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 0.020560000091791153, "best_triton_pos": 1, "best_triton_time": 0.10667700320482254, "best_triton_kernel": "triton_mm_42", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4"} 2025-12-04T13:44:26.3670876Z AUTOTUNE mm(1024x2048, 2048x1024) 2025-12-04T13:44:26.3670918Z strides: [s20, 1], [s79, 1] 2025-12-04T13:44:26.3670968Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.3671006Z mm 0.0206 ms 100.0% 2025-12-04T13:44:26.3671247Z triton_mm_42 0.1067 ms 19.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3671486Z triton_mm_55 0.1094 ms 18.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3671717Z triton_mm_43 0.1130 ms 18.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3671951Z triton_mm_54 0.1206 ms 17.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3672180Z triton_mm_47 0.1239 ms 16.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3672412Z triton_mm_44 0.1264 ms 16.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3672661Z triton_mm_57 0.1560 ms 13.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3672896Z triton_mm_56 0.1682 ms 12.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T13:44:26.3673128Z triton_mm_46 0.1707 ms 12.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3673283Z SingleProcess AUTOTUNE benchmarking takes 1.2894 seconds and 0.5485 seconds precompiling for 37 choices 2025-12-04T13:44:26.3673456Z [rank2]:[W1204 13:42:31.113953967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3673645Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3673903Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3674065Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3674434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3674637Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3674745Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3674840Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3674938Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3674940Z 2025-12-04T13:44:26.3675179Z [rank2]:[W1204 13:42:31.115296057 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3675350Z [rank3]:[W1204 13:42:31.129759291 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3675524Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3675780Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3675943Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3676311Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3676513Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3676628Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3676724Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3676819Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3676821Z 2025-12-04T13:44:26.3677054Z [rank3]:[W1204 13:42:31.130932745 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3680765Z [rank1]:[W1204 13:42:31.797625499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3680945Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3681229Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3681391Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3681765Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3681967Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3682072Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3682169Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3682264Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3682266Z 2025-12-04T13:44:26.3682499Z [rank1]:[W1204 13:42:31.800113674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3682670Z [rank2]:[W1204 13:42:32.115456406 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3682844Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3683103Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3683264Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3683631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3683832Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3683937Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3684031Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3684142Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3684144Z 2025-12-04T13:44:26.3684378Z [rank2]:[W1204 13:42:32.116694489 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3684559Z [rank3]:[W1204 13:42:32.131064044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3684748Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3685005Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3685180Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3685551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3685752Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3685856Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3685951Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3686048Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3686050Z 2025-12-04T13:44:26.3686281Z [rank3]:[W1204 13:42:32.132244438 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3686451Z [rank1]:[W1204 13:42:32.800286753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3686629Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3686884Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3687048Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3687416Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3687642Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3687748Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3687843Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3687939Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3687955Z 2025-12-04T13:44:26.3688188Z [rank1]:[W1204 13:42:32.802764568 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3688357Z [rank2]:[W1204 13:42:33.116842487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3688556Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3688812Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3688998Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3689364Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3689567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3689673Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3689767Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3689865Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3689867Z 2025-12-04T13:44:26.3690101Z [rank2]:[W1204 13:42:33.118057801 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3690271Z [rank3]:[W1204 13:42:33.132393347 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3690450Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3690706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3690868Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3691236Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3691438Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3691543Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3691637Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3691735Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3691738Z 2025-12-04T13:44:26.3691996Z [rank3]:[W1204 13:42:33.133589740 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3692167Z [rank1]:[W1204 13:42:33.802921137 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3692340Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3692617Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3692780Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3693156Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3693360Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3693464Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3693559Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3693654Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3693657Z 2025-12-04T13:44:26.3693893Z [rank1]:[W1204 13:42:33.805460921 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3694065Z [rank2]:[W1204 13:42:34.118201250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3694238Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3694494Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3694656Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3695025Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3695226Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3695331Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3695427Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3695524Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3695526Z 2025-12-04T13:44:26.3695762Z [rank2]:[W1204 13:42:34.119599199 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3695942Z [rank3]:[W1204 13:42:34.133704870 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3696119Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3696386Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3696558Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3696923Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3697135Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3697238Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3697334Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3697431Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3697432Z 2025-12-04T13:44:26.3697700Z [rank3]:[W1204 13:42:34.134936723 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3697873Z [rank1]:[W1204 13:42:34.805644479 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3698047Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3698303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3698469Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3698834Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3699036Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3699139Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3699235Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3699332Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3699333Z 2025-12-04T13:44:26.3699566Z [rank1]:[W1204 13:42:34.807944059 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3699750Z [rank2]:[W1204 13:42:35.119815766 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3699924Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3700179Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3700368Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3700738Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3700952Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3701055Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3701150Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3701247Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3701251Z 2025-12-04T13:44:26.3701483Z [rank2]:[W1204 13:42:35.121065139 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3701652Z [rank3]:[W1204 13:42:35.135062282 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3701827Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3702081Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3702245Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3702610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3702815Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3702920Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3703014Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3703109Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3703112Z 2025-12-04T13:44:26.3703344Z [rank3]:[W1204 13:42:35.136233637 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3703513Z [rank1]:[W1204 13:42:35.808133067 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3703700Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3703955Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3704120Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3704516Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3704728Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3704830Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3704927Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3705023Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3705025Z 2025-12-04T13:44:26.3705261Z [rank1]:[W1204 13:42:35.810447196 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3705432Z [rank2]:[W1204 13:42:36.121210308 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3705605Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3705859Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3706021Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3706387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3706589Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3706695Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3706790Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3706887Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3706889Z 2025-12-04T13:44:26.3707126Z [rank2]:[W1204 13:42:36.122441721 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3707295Z [rank3]:[W1204 13:42:36.136379206 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3707507Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3707775Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3707938Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3708317Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3708530Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3708647Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3708742Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3708838Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3708840Z 2025-12-04T13:44:26.3709072Z [rank3]:[W1204 13:42:36.138004570 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3709245Z [rank1]:[W1204 13:42:36.810634084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3709419Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3709677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3709838Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3710204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3710407Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3710511Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3710607Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3710702Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3710703Z 2025-12-04T13:44:26.3710936Z [rank1]:[W1204 13:42:36.812739738 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3711107Z [rank2]:[W1204 13:42:37.122594110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3711281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3711550Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3711713Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3712093Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3712305Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3712410Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3712517Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3712613Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3712615Z 2025-12-04T13:44:26.3712848Z [rank2]:[W1204 13:42:37.123817533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3713016Z [rank3]:[W1204 13:42:37.138171839 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3713192Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3713447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3713612Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3713980Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3714182Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3714285Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3714380Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3714477Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3714479Z 2025-12-04T13:44:26.3714710Z [rank3]:[W1204 13:42:37.140252603 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3714879Z [rank1]:[W1204 13:42:37.812919156 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3715055Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3715308Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3715480Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3715853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3716079Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3716182Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3716277Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3716382Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3716385Z 2025-12-04T13:44:26.3716619Z [rank1]:[W1204 13:42:37.815124578 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3716790Z [rank2]:[W1204 13:42:38.123962482 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3716965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3717220Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3717383Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3717788Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3717991Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3718096Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3718191Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3718288Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3718289Z 2025-12-04T13:44:26.3718523Z [rank2]:[W1204 13:42:38.125324082 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3718692Z [rank3]:[W1204 13:42:38.140397762 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3718865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3719122Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3719285Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3719673Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3719874Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3720013Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3720108Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3720206Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3720224Z 2025-12-04T13:44:26.3720457Z [rank3]:[W1204 13:42:38.142475907 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3720628Z [rank1]:[W1204 13:42:38.815309336 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3720801Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3721057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3721221Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3721588Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3721791Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3721896Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3721992Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3722089Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3722091Z 2025-12-04T13:44:26.3722330Z [rank1]:[W1204 13:42:38.817697624 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3722503Z [rank2]:[W1204 13:42:39.125467382 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3722677Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3722933Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3723096Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3723475Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3723677Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3723781Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3723897Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3723994Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3723996Z 2025-12-04T13:44:26.3724229Z [rank2]:[W1204 13:42:39.127954087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3724412Z [rank3]:[W1204 13:42:39.142625766 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3724588Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3724845Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3725009Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3725378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3725580Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3725684Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3725779Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3725876Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3725877Z 2025-12-04T13:44:26.3726109Z [rank3]:[W1204 13:42:39.144269330 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3726280Z [rank1]:[W1204 13:42:39.817885712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3726454Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3726710Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3726877Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3727251Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3727455Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3727583Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3727681Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3727802Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3727806Z 2025-12-04T13:44:26.3728038Z [rank1]:[W1204 13:42:39.819898298 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3728222Z [rank2]:[W1204 13:42:40.128100876 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3728395Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3728649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3728814Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3729188Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3729391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3729494Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3729589Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3729685Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3729688Z 2025-12-04T13:44:26.3729922Z [rank2]:[W1204 13:42:40.129601613 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3730091Z [rank3]:[W1204 13:42:40.144396799 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3730269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3730523Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3730686Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3731057Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3731272Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3731377Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3731471Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3731566Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3731578Z 2025-12-04T13:44:26.3731820Z [rank3]:[W1204 13:42:40.145775149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3731991Z [rank1]:[W1204 13:42:40.820071917 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3732175Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3732428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3732589Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3732955Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3733159Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3733263Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3733359Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3733453Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3733457Z 2025-12-04T13:44:26.3733691Z [rank1]:[W1204 13:42:40.821434887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3733861Z [rank2]:[W1204 13:42:41.129791772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3734035Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3734289Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3734451Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3734819Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3735022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3735141Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3735237Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3735334Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3735336Z 2025-12-04T13:44:26.3735579Z [rank2]:[W1204 13:42:41.131169972 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3735758Z [rank3]:[W1204 13:42:41.145956908 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3735933Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3736199Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3736362Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3736730Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3736932Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3737037Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3737132Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3737228Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3737230Z 2025-12-04T13:44:26.3737463Z [rank3]:[W1204 13:42:41.147818937 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3737705Z [rank1]:[W1204 13:42:41.821610936 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3737881Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3738139Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3738303Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3738675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3738879Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3738983Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3739090Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3739186Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3739189Z 2025-12-04T13:44:26.3739420Z [rank1]:[W1204 13:42:41.823502924 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3739618Z [rank2]:[W1204 13:42:42.131338240 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3739794Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3740052Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3740228Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3740595Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3740798Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3740902Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3740998Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3741094Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3741096Z 2025-12-04T13:44:26.3741331Z [rank2]:[W1204 13:42:42.133062463 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3741502Z [rank3]:[W1204 13:42:42.147967336 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3741677Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3741932Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3742097Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3742464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3742667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3742771Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3742866Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3742973Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3742975Z 2025-12-04T13:44:26.3743207Z [rank3]:[W1204 13:42:42.150200227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3743378Z [rank1]:[W1204 13:42:42.823671923 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3743574Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3743828Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3744003Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3744372Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3744575Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3744679Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3744775Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3744872Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3744874Z 2025-12-04T13:44:26.3745106Z [rank1]:[W1204 13:42:42.826047531 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3745276Z [rank2]:[W1204 13:42:43.133226052 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3745451Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3745706Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3745869Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3746237Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3746442Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3746547Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3746643Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3746739Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3746742Z 2025-12-04T13:44:26.3746987Z [rank2]:[W1204 13:42:43.135134800 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3747156Z [rank3]:[W1204 13:42:43.150348377 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3747331Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3747649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3747813Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3748204Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3748404Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3748511Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3748605Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3748702Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3748705Z 2025-12-04T13:44:26.3748938Z [rank3]:[W1204 13:42:43.152500589 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3749109Z [rank1]:[W1204 13:42:43.826213540 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3749284Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3749540Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3749704Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3750072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3750274Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3750380Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3750476Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3750571Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3750573Z 2025-12-04T13:44:26.3750824Z [rank1]:[W1204 13:42:43.828232596 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3750996Z [rank2]:[W1204 13:42:44.135288139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3751170Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3751434Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3751612Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3751989Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3752191Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3752295Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3752393Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3752490Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3752491Z 2025-12-04T13:44:26.3752727Z [rank2]:[W1204 13:42:44.137341654 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3752899Z [rank3]:[W1204 13:42:44.152648039 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3753075Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3753331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3753494Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3753860Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3754061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3754165Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3754259Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3754356Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3754358Z 2025-12-04T13:44:26.3754590Z [rank3]:[W1204 13:42:44.154944409 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3754771Z [rank1]:[W1204 13:42:44.828404665 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3754947Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3755207Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3755392Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3755760Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3755974Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3756076Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3756171Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3756269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3756271Z 2025-12-04T13:44:26.3756502Z [rank1]:[W1204 13:42:44.830636696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3756673Z [rank2]:[W1204 13:42:45.137489884 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3756847Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3757102Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3757267Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3757669Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3757873Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3757975Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3758072Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3758168Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3758170Z 2025-12-04T13:44:26.3758406Z [rank2]:[W1204 13:42:45.139324454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3758575Z [rank3]:[W1204 13:42:45.155089538 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3758763Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3759019Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3759180Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3759580Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3759794Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3759898Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3759992Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3760088Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3760091Z 2025-12-04T13:44:26.3760326Z [rank3]:[W1204 13:42:45.157316140 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3760497Z [rank1]:[W1204 13:42:45.830842694 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3760675Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3760929Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3761090Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3761459Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3761662Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3761768Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3761862Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3761958Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3761960Z 2025-12-04T13:44:26.3762192Z [rank1]:[W1204 13:42:45.832654415 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3762364Z [rank2]:[W1204 13:42:46.139492314 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3762539Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3762805Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3762967Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3763347Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3763567Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3763682Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3763779Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3763876Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3763878Z 2025-12-04T13:44:26.3764118Z [rank2]:[W1204 13:42:46.141412662 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3764289Z [rank3]:[W1204 13:42:46.157479480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3764464Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3764721Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3764884Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3765252Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3765453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3765558Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3765653Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3765750Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3765752Z 2025-12-04T13:44:26.3765987Z [rank3]:[W1204 13:42:46.159597223 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3766160Z [rank1]:[W1204 13:42:46.832831325 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3766335Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3766600Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3766763Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3767137Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3767348Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3767466Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3767604Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3767700Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3767702Z 2025-12-04T13:44:26.3767935Z [rank1]:[W1204 13:42:46.835277941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3768107Z [rank2]:[W1204 13:42:47.141579231 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3768281Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3768541Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3768705Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3769074Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3769278Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3769381Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3769478Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3769573Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3769575Z 2025-12-04T13:44:26.3769808Z [rank2]:[W1204 13:42:47.143213185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3769978Z [rank3]:[W1204 13:42:47.159800142 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3770154Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3770412Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3770589Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3770958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3771184Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3771289Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3771398Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3771494Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3771496Z 2025-12-04T13:44:26.3771731Z [rank3]:[W1204 13:42:47.161247440 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3771900Z [rank1]:[W1204 13:42:47.835470590 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3772075Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3772329Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3772492Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3772860Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3773064Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3773169Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3773263Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3773360Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3773361Z 2025-12-04T13:44:26.3773594Z [rank1]:[W1204 13:42:47.837734380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3773765Z [rank2]:[W1204 13:42:48.143370465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3773940Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3774196Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3774358Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3774737Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3774941Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3775064Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3775160Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3775256Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3775268Z 2025-12-04T13:44:26.3775502Z [rank2]:[W1204 13:42:48.145355711 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3775671Z [rank3]:[W1204 13:42:48.161404990 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3775846Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3776104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3776265Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3776633Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3776834Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3776941Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3777036Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3777132Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3777136Z 2025-12-04T13:44:26.3777370Z [rank3]:[W1204 13:42:48.163548233 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3777586Z [rank1]:[W1204 13:42:48.837921649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3777761Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3778017Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3778180Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3778558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3778760Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3778864Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3778987Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3779084Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3779086Z 2025-12-04T13:44:26.3779335Z [rank1]:[W1204 13:42:48.840225069 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3779506Z [rank2]:[W1204 13:42:49.145509961 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3779679Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3779935Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3780097Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3780465Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3780668Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3780771Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3780868Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3780964Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3780965Z 2025-12-04T13:44:26.3781202Z [rank2]:[W1204 13:42:49.147822170 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3781376Z [rank3]:[W1204 13:42:49.163671593 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3781552Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3781809Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3781972Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3782350Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3782551Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3782655Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3782749Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3782866Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3782868Z 2025-12-04T13:44:26.3783102Z [rank3]:[W1204 13:42:49.165755887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3783284Z [rank1]:[W1204 13:42:49.840399188 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3783460Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3783717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3783882Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3784248Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3784452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3784556Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3784651Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3784751Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3784752Z 2025-12-04T13:44:26.3784985Z [rank1]:[W1204 13:42:49.842605890 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3785029Z Autotune Choices Stats: 2025-12-04T13:44:26.3785498Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 0.017799999564886093, "best_triton_pos": 1, "best_triton_time": 0.07227899879217148, "best_triton_kernel": "triton_mm_90", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4"} 2025-12-04T13:44:26.3785548Z AUTOTUNE mm(2048x1024, 1024x1024) 2025-12-04T13:44:26.3785591Z strides: [s79, 1], [s52, 1] 2025-12-04T13:44:26.3785644Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.3785682Z mm 0.0178 ms 100.0% 2025-12-04T13:44:26.3785923Z triton_mm_90 0.0723 ms 24.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3786169Z triton_mm_91 0.0798 ms 22.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3786403Z triton_mm_93 0.0828 ms 21.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3786643Z triton_mm_92 0.0856 ms 20.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T13:44:26.3786881Z triton_mm_82 0.0944 ms 18.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3787134Z triton_mm_103 0.0963 ms 18.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3787363Z triton_mm_78 0.1037 ms 17.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3787635Z triton_mm_94 0.1090 ms 16.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3787871Z triton_mm_87 0.1096 ms 16.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3788006Z SingleProcess AUTOTUNE benchmarking takes 20.4406 seconds and 0.9434 seconds precompiling for 37 choices 2025-12-04T13:44:26.3788180Z [rank2]:[W1204 13:42:50.147944161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3788355Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3788614Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3788781Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3789153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3789357Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3789462Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3789559Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3789656Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3789658Z 2025-12-04T13:44:26.3789896Z [rank2]:[W1204 13:42:50.150399417 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3790083Z [rank3]:[W1204 13:42:50.165909977 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3790258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3790514Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3790703Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3791073Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3791287Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3791392Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3791489Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3791587Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3791588Z 2025-12-04T13:44:26.3791823Z [rank3]:[W1204 13:42:50.168101839 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3791993Z [rank1]:[W1204 13:42:50.842799189 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3792169Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3792426Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3792591Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3792957Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3793159Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3793263Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3793357Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3793453Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3793456Z 2025-12-04T13:44:26.3793688Z [rank1]:[W1204 13:42:50.845138407 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3793859Z [rank2]:[W1204 13:42:51.150530377 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3794050Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3794310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3794475Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3794862Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3795074Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3795178Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3795274Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3795369Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3795372Z 2025-12-04T13:44:26.3795606Z [rank2]:[W1204 13:42:51.152557863 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3795776Z [rank3]:[W1204 13:42:51.168254469 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3795951Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3796208Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3796373Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3796746Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3796947Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3797054Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3797150Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3797245Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3797247Z 2025-12-04T13:44:26.3797517Z [rank3]:[W1204 13:42:51.170311244 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3797686Z [rank1]:[W1204 13:42:51.845322397 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3797862Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3798130Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3798294Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3798693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3798895Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3799013Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3799108Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3799204Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3799206Z 2025-12-04T13:44:26.3799438Z [rank1]:[W1204 13:42:51.847570167 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3799611Z [rank2]:[W1204 13:42:52.152661644 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3799784Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3800042Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3800206Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3800572Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3800776Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3800882Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3800980Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3801076Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3801078Z 2025-12-04T13:44:26.3801311Z [rank2]:[W1204 13:42:52.154549173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3801483Z [rank3]:[W1204 13:42:52.170466434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3801656Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3801923Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3802084Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3802460Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3802672Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3802786Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3802883Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3802979Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3802981Z 2025-12-04T13:44:26.3803217Z [rank3]:[W1204 13:42:52.172786773 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3803259Z Autotune Choices Stats: 2025-12-04T13:44:26.3803723Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 0.017839999869465828, "best_triton_pos": 1, "best_triton_time": 0.07299800217151642, "best_triton_kernel": "triton_mm_18", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4"} 2025-12-04T13:44:26.3803767Z AUTOTUNE mm(1024x1024, 1024x2048) 2025-12-04T13:44:26.3803809Z strides: [s52, 1], [s20, 1] 2025-12-04T13:44:26.3803858Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T13:44:26.3803896Z mm 0.0178 ms 100.0% 2025-12-04T13:44:26.3804132Z triton_mm_18 0.0730 ms 24.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3804367Z triton_mm_19 0.0836 ms 21.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3804599Z triton_mm_21 0.0848 ms 21.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3804832Z triton_mm_20 0.0906 ms 19.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T13:44:26.3805067Z triton_mm_31 0.0963 ms 18.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3805299Z triton_mm_10 0.0986 ms 18.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3805529Z triton_mm_6 0.1061 ms 16.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3805772Z triton_mm_22 0.1064 ms 16.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T13:44:26.3806003Z triton_mm_11 0.1103 ms 16.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T13:44:26.3806155Z SingleProcess AUTOTUNE benchmarking takes 22.5363 seconds and 0.1481 seconds precompiling for 37 choices 2025-12-04T13:44:26.3806327Z [rank1]:[W1204 13:42:52.847742087 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3806518Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3806774Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3806938Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3807310Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3807532Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3807640Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3807736Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3807833Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3807835Z 2025-12-04T13:44:26.3808072Z [rank1]:[W1204 13:42:52.850103715 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3808245Z [rank2]:[W1204 13:42:53.154702583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3808421Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3808677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3808841Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3809208Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3809414Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3809517Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3809628Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3809725Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3809727Z 2025-12-04T13:44:26.3809959Z [rank2]:[W1204 13:42:53.156641490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3810155Z [rank3]:[W1204 13:42:53.173184368 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3810329Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3810601Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3810762Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3811131Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3811334Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3811438Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3811537Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3811633Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3811635Z 2025-12-04T13:44:26.3811870Z [rank3]:[W1204 13:42:53.175321621 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3812042Z [rank1]:[W1204 13:42:53.850231186 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3812216Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3812471Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3812635Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3813001Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3813204Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3813308Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3813404Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3813509Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3813512Z 2025-12-04T13:44:26.3813748Z [rank1]:[W1204 13:42:53.852648823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3813920Z [rank2]:[W1204 13:42:54.156808650 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3814117Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3814372Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3814548Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3814915Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3815119Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3815222Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3815318Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3815415Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3815418Z 2025-12-04T13:44:26.3815650Z [rank2]:[W1204 13:42:54.158955533 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3815821Z [rank3]:[W1204 13:42:54.175469981 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3815998Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3816253Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3816418Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3816784Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3816986Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3817089Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3817184Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3817279Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3817281Z 2025-12-04T13:44:26.3817550Z [rank3]:[W1204 13:42:54.177727002 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3817720Z [rank1]:[W1204 13:42:54.852838772 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3817906Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3818178Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3818353Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3818720Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3818920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3819026Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3819120Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3819218Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3819220Z 2025-12-04T13:44:26.3819452Z [rank1]:[W1204 13:42:54.855213630 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3819622Z [rank2]:[W1204 13:42:55.159098753 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3819796Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3820050Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3820216Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3820585Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3820787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3820893Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3820988Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3821084Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3821087Z 2025-12-04T13:44:26.3821328Z [rank2]:[W1204 13:42:55.161244356 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3821499Z [rank3]:[W1204 13:42:55.177878992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3821673Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3821957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3822119Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3822502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3822703Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3822805Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3822904Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3822999Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3823001Z 2025-12-04T13:44:26.3823234Z [rank3]:[W1204 13:42:55.180258520 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3823405Z [rank1]:[W1204 13:42:55.855394250 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3823580Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3823835Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3823997Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3824363Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3824565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3824669Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3824767Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3824863Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3824865Z 2025-12-04T13:44:26.3825098Z [rank1]:[W1204 13:42:55.857342397 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3825277Z [rank2]:[W1204 13:42:56.161377487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3825452Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3825707Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3825889Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3826255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3826468Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3826572Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3826668Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3826767Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3826769Z 2025-12-04T13:44:26.3827003Z [rank2]:[W1204 13:42:56.163810654 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3827175Z [rank3]:[W1204 13:42:56.180386431 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3827350Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3827646Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3827809Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3828176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3828381Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3828484Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3828579Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3828675Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3828677Z 2025-12-04T13:44:26.3828916Z [rank3]:[W1204 13:42:56.182537393 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3829093Z [rank1]:[W1204 13:42:56.857521967 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3829282Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3829538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3829714Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3830093Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3830306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3830411Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3830505Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3830600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3830603Z 2025-12-04T13:44:26.3830836Z [rank1]:[W1204 13:42:56.858901197 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3831006Z [rank2]:[W1204 13:42:57.163990393 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3831183Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3831439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3831601Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3831970Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3832177Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3832281Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3832375Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3832471Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3832473Z 2025-12-04T13:44:26.3832707Z [rank2]:[W1204 13:42:57.166388111 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3832879Z [rank3]:[W1204 13:42:57.182685294 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3833053Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3833322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3833484Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3833869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3834071Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3834186Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3834282Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3834377Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3834378Z 2025-12-04T13:44:26.3834612Z [rank3]:[W1204 13:42:57.184843227 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3834781Z [rank1]:[W1204 13:42:57.859091466 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3834956Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3835213Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3835376Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3835747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3835949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3836055Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3836150Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3836246Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3836247Z 2025-12-04T13:44:26.3836481Z [rank1]:[W1204 13:42:57.860873157 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3836651Z [rank2]:[W1204 13:42:58.167733265 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3836826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3837094Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3837258Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3837691Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3837907Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3838025Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3838120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3838217Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3838218Z 2025-12-04T13:44:26.3838450Z [rank2]:[W1204 13:42:58.169212203 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3838623Z [rank3]:[W1204 13:42:58.184952738 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3838797Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3839054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3839216Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3839583Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3839785Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3839888Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3839985Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3840081Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3840082Z 2025-12-04T13:44:26.3840315Z [rank3]:[W1204 13:42:58.187189759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3840484Z [rank1]:[W1204 13:42:58.861021248 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3840660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3840918Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3841094Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3841463Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3841684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3841789Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3841893Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3841990Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3841992Z 2025-12-04T13:44:26.3842229Z [rank1]:[W1204 13:42:58.863493134 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3842397Z [rank2]:[W1204 13:42:59.169301855 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3842573Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3842827Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3842993Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3843360Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3843565Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3843670Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3843765Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3843863Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3843867Z 2025-12-04T13:44:26.3844099Z [rank2]:[W1204 13:42:59.170450319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3844270Z [rank3]:[W1204 13:42:59.187295271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3844445Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3844700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3844864Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3845240Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3845453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3845566Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3845662Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3845767Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3845769Z 2025-12-04T13:44:26.3846002Z [rank3]:[W1204 13:42:59.189284777 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3846171Z [rank1]:[W1204 13:42:59.863658774 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3846347Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3846609Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3846774Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3847143Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3847344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3847451Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3847590Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3847689Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3847691Z 2025-12-04T13:44:26.3847925Z [rank1]:[W1204 13:42:59.865699519 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3848097Z [rank2]:[W1204 13:43:00.170627070 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3848271Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3848528Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3848693Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3849072Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3849275Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3849405Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3849500Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3849598Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3849613Z 2025-12-04T13:44:26.3849846Z [rank2]:[W1204 13:43:00.172732423 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3850017Z [rank3]:[W1204 13:43:00.189433718 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3850191Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3850447Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3850611Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3850985Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3851187Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3851291Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3851388Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3851483Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3851485Z 2025-12-04T13:44:26.3851719Z [rank3]:[W1204 13:43:00.191402234 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3851892Z [rank1]:[W1204 13:43:00.865885739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3852066Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3852324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3852486Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3852869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3853073Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3853179Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3853284Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3853389Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3853391Z 2025-12-04T13:44:26.3853624Z [rank1]:[W1204 13:43:00.868258667 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3853806Z [rank2]:[W1204 13:43:01.172978112 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3853983Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3854237Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3854403Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3854769Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3854975Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3855080Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3855177Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3855276Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3855277Z 2025-12-04T13:44:26.3855510Z [rank2]:[W1204 13:43:01.174444060 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3855699Z [rank3]:[W1204 13:43:01.191555855 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3855873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3856129Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3856293Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3856660Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3856872Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3856975Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3857071Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3857175Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3857186Z 2025-12-04T13:44:26.3857424Z [rank3]:[W1204 13:43:01.193346806 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3857638Z [rank1]:[W1204 13:43:01.868451547 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3857814Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3858069Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3858232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3858598Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3858800Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3858904Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3858999Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3859095Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3859098Z 2025-12-04T13:44:26.3859331Z [rank1]:[W1204 13:43:01.870228888 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3859503Z [rank2]:[W1204 13:43:02.174594251 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3859683Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3859940Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3860103Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3860471Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3860689Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3860795Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3860891Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3860990Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3860991Z 2025-12-04T13:44:26.3861250Z [rank2]:[W1204 13:43:02.175795364 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3861420Z [rank3]:[W1204 13:43:02.193452588 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3861606Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3861867Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3862032Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3862400Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3862601Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3862707Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3862803Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3862897Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3862899Z 2025-12-04T13:44:26.3863133Z [rank3]:[W1204 13:43:02.195376565 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3863305Z [rank1]:[W1204 13:43:02.870428198 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3863478Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3863736Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3863899Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3864272Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3864473Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3864588Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3864686Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3864780Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3864782Z 2025-12-04T13:44:26.3865015Z [rank1]:[W1204 13:43:02.873030301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3865203Z [rank2]:[W1204 13:43:03.175944415 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3865380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3865649Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3865813Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3866185Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3866389Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3866497Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3866593Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3866691Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3866693Z 2025-12-04T13:44:26.3866925Z [rank2]:[W1204 13:43:03.177211217 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3867097Z [rank3]:[W1204 13:43:03.195541196 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3867270Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3867574Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3867736Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3868102Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3868308Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3868411Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3868510Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3868617Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3868619Z 2025-12-04T13:44:26.3868855Z [rank3]:[W1204 13:43:03.197786917 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3869038Z [rank1]:[W1204 13:43:03.873185862 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3869224Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3869480Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3869658Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3870030Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3870233Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3870338Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3870434Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3870531Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3870533Z 2025-12-04T13:44:26.3870769Z [rank1]:[W1204 13:43:03.874448054 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3870938Z [rank2]:[W1204 13:43:04.177378198 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3871115Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3871371Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3871535Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3871903Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3872105Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3872209Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3872304Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3872401Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3872403Z 2025-12-04T13:44:26.3872649Z [rank2]:[W1204 13:43:04.179211938 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3872821Z [rank3]:[W1204 13:43:04.197955537 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3873016Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3873272Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3873446Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3873812Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3874014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3874118Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3874215Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3874311Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3874314Z 2025-12-04T13:44:26.3874548Z [rank3]:[W1204 13:43:04.200270696 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3874719Z [rank1]:[W1204 13:43:04.874613035 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3874895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3875152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3875314Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3875682Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3875882Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3875988Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3876083Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3876178Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3876181Z 2025-12-04T13:44:26.3876424Z [rank1]:[W1204 13:43:04.876703479 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3876594Z [rank2]:[W1204 13:43:05.179365769 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3876769Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3877046Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3877214Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3877639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3877842Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3877949Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3878044Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3878141Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3878143Z 2025-12-04T13:44:26.3878377Z [rank2]:[W1204 13:43:05.181268147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3878547Z [rank3]:[W1204 13:43:05.200419467 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3878721Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3878978Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3879143Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3879514Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3879716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3879819Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3879916Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3880011Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3880013Z 2025-12-04T13:44:26.3880247Z [rank3]:[W1204 13:43:05.202315506 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3880435Z [rank1]:[W1204 13:43:05.876875279 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3880610Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3880882Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3881064Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3881435Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3881654Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3881758Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3881856Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3881952Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3881954Z 2025-12-04T13:44:26.3882187Z [rank1]:[W1204 13:43:05.878721539 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3882360Z [rank2]:[W1204 13:43:06.181433448 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3882536Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3882791Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3882956Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3883323Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3883525Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3883630Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3883726Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3883825Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3883827Z 2025-12-04T13:44:26.3884060Z [rank2]:[W1204 13:43:06.183443114 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3884242Z [rank3]:[W1204 13:43:06.202473157 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3884416Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3884674Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3884859Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3885227Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3885445Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3885548Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3885643Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3885738Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3885743Z 2025-12-04T13:44:26.3885977Z [rank3]:[W1204 13:43:06.204911823 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3886149Z [rank1]:[W1204 13:43:06.878882640 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3886324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3886578Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3886739Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3887107Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3887310Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3887413Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3887553Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3887649Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3887652Z 2025-12-04T13:44:26.3887888Z [rank1]:[W1204 13:43:06.880148552 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3888060Z [rank2]:[W1204 13:43:07.183591405 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3888253Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3888509Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3888673Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3889066Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3889280Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3889385Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3889480Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3889579Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3889581Z 2025-12-04T13:44:26.3889815Z [rank2]:[W1204 13:43:07.185014544 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3889986Z [rank3]:[W1204 13:43:07.205066744 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3890163Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3890421Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3890584Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3890952Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3891155Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3891260Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3891356Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3891451Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3891455Z 2025-12-04T13:44:26.3891688Z [rank3]:[W1204 13:43:07.206993392 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3891860Z [rank1]:[W1204 13:43:07.880308523 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3892034Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3892303Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3892470Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3892849Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3893062Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3893174Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3893270Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3893365Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3893367Z 2025-12-04T13:44:26.3893601Z [rank1]:[W1204 13:43:07.881635594 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3893772Z [rank2]:[W1204 13:43:08.185164165 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3893948Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3894204Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3894367Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3894741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3894944Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3895049Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3895144Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3895243Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3895245Z 2025-12-04T13:44:26.3895477Z [rank2]:[W1204 13:43:08.186390348 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3895650Z [rank3]:[W1204 13:43:08.207103384 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3895826Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3896091Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3896256Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3896647Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3896868Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3896975Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3897083Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3897179Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3897183Z 2025-12-04T13:44:26.3897415Z [rank3]:[W1204 13:43:08.209274577 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3897615Z [rank1]:[W1204 13:43:08.881800015 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3897790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3898047Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3898210Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3898578Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3898781Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3898887Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3898983Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3899079Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3899081Z 2025-12-04T13:44:26.3899317Z [rank1]:[W1204 13:43:08.884064925 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3899486Z [rank2]:[W1204 13:43:09.186556649 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3899663Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3899920Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3900103Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3900473Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3900702Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3900807Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3900902Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3901010Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3901012Z 2025-12-04T13:44:26.3901247Z [rank2]:[W1204 13:43:09.188470737 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3901419Z [rank3]:[W1204 13:43:09.209443867 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3901600Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3901857Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3902021Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3902389Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3902597Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3902702Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3902797Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3902894Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3902896Z 2025-12-04T13:44:26.3903129Z [rank3]:[W1204 13:43:09.211522142 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3903301Z [rank1]:[W1204 13:43:09.884237126 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3903475Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3903733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3903895Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3904273Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3904476Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3908829Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3908935Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3909031Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3909050Z 2025-12-04T13:44:26.3909293Z [rank1]:[W1204 13:43:09.886498976 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3909465Z [rank2]:[W1204 13:43:10.188569539 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3909643Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3909906Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3910072Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3910447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3910649Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3910759Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3910856Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3910954Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3910956Z 2025-12-04T13:44:26.3911191Z [rank2]:[W1204 13:43:10.190344640 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3911362Z [rank3]:[W1204 13:43:10.211634994 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3911538Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3911794Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3911961Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3912346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3912547Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3912652Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3912772Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3912869Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3912872Z 2025-12-04T13:44:26.3913103Z [rank3]:[W1204 13:43:10.213550982 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3913289Z [rank1]:[W1204 13:43:10.886679807 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3913463Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3913720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3913882Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3914255Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3914459Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3914563Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3914658Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3914756Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3914758Z 2025-12-04T13:44:26.3914991Z [rank1]:[W1204 13:43:10.889099844 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3915163Z [rank2]:[W1204 13:43:11.190491982 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3915337Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3915591Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3915756Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3916140Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3916344Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3916450Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3916544Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3916663Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3916665Z 2025-12-04T13:44:26.3916897Z [rank2]:[W1204 13:43:11.192240674 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3917083Z [rank3]:[W1204 13:43:11.213701573 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3917260Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3917558Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3917723Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3918089Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3918301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3918405Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3918502Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3918598Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3918601Z 2025-12-04T13:44:26.3918834Z [rank3]:[W1204 13:43:11.215404926 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3919004Z [rank1]:[W1204 13:43:11.889271325 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3919181Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3919435Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3919597Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3919965Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3920180Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3920284Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3920380Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3920475Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3920490Z 2025-12-04T13:44:26.3920734Z [rank1]:[W1204 13:43:11.891320130 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3920905Z [rank2]:[W1204 13:43:12.192378865 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3921096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3921351Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3921513Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3921886Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3922087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3922192Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3922287Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3922383Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3922385Z 2025-12-04T13:44:26.3922620Z [rank2]:[W1204 13:43:12.194143887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3922791Z [rank3]:[W1204 13:43:12.215550088 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3922966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3923221Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3923384Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3923753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3923955Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3924070Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3924165Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3924260Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3924263Z 2025-12-04T13:44:26.3924505Z [rank3]:[W1204 13:43:12.217471446 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3924687Z [rank1]:[W1204 13:43:12.891493161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3924861Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3925128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3925290Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3925658Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3925861Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3925965Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3926061Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3926156Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3926158Z 2025-12-04T13:44:26.3926391Z [rank1]:[W1204 13:43:12.893832380 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3926562Z [rank2]:[W1204 13:43:13.194288308 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3926737Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3926996Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3927161Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3927586Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3927787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3927894Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3928002Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3928099Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3928101Z 2025-12-04T13:44:26.3928332Z [rank2]:[W1204 13:43:13.196027850 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3928527Z [rank3]:[W1204 13:43:13.217827583 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3928702Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3928957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3929142Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3929512Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3929716Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3929820Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3929916Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3930014Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3930016Z 2025-12-04T13:44:26.3930247Z [rank3]:[W1204 13:43:13.219904447 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3930418Z [rank1]:[W1204 13:43:13.894022601 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3930592Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3930847Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3931010Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3931378Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3931582Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3931685Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3931781Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3931885Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3931887Z 2025-12-04T13:44:26.3932122Z [rank1]:[W1204 13:43:13.896183913 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3932290Z [rank2]:[W1204 13:43:14.196184372 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3932487Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3932744Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3932917Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3933286Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3933491Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3933596Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3933691Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3933788Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3933789Z 2025-12-04T13:44:26.3934022Z [rank2]:[W1204 13:43:14.198279056 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3934192Z [rank3]:[W1204 13:43:14.220056399 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3934369Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3934624Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3934787Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3935151Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3935352Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3935457Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3935553Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3935649Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3935651Z 2025-12-04T13:44:26.3935894Z [rank3]:[W1204 13:43:14.222294850 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3936064Z [rank1]:[W1204 13:43:14.896307465 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3936237Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3936512Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3936674Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3937051Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3937253Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3937359Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3937454Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3937587Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3937590Z 2025-12-04T13:44:26.3937825Z [rank1]:[W1204 13:43:14.898684843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3937995Z [rank2]:[W1204 13:43:15.198416638 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3938169Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3938426Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3938588Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3938958Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3939160Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3939265Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3939361Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3939457Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3939460Z 2025-12-04T13:44:26.3939709Z [rank2]:[W1204 13:43:15.199844916 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3939879Z [rank3]:[W1204 13:43:15.222443161 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3940054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3940321Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3940496Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3940861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3941078Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3941182Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3941278Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3941374Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3941376Z 2025-12-04T13:44:26.3941608Z [rank3]:[W1204 13:43:15.224631843 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3941653Z Result from 0 is 21102592.0 2025-12-04T13:44:26.3941695Z Result2 from 0 is 5832704.0 2025-12-04T13:44:26.3941735Z Result from 3 is 21102592.0 2025-12-04T13:44:26.3941774Z Result2 from 3 is 5832704.0 2025-12-04T13:44:26.3941813Z Result from 1 is 21102592.0 2025-12-04T13:44:26.3941852Z Result2 from 1 is 5832704.0 2025-12-04T13:44:26.3941891Z Result from 2 is 21102592.0 2025-12-04T13:44:26.3941929Z Result2 from 2 is 5832704.0 2025-12-04T13:44:26.3942103Z [rank1]:[W1204 13:43:15.898865034 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3942280Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3942541Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3942706Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3943074Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3943276Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3943380Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3943477Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3943585Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3943588Z 2025-12-04T13:44:26.3943821Z [rank1]:[W1204 13:43:15.900853941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3943992Z [rank2]:[W1204 13:43:16.200029637 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3944193Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3944451Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3944626Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3944995Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3945197Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3945302Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3945397Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3945494Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3945497Z 2025-12-04T13:44:26.3945729Z [rank2]:[W1204 13:43:16.202208319 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3945898Z [rank3]:[W1204 13:43:16.224780755 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3946075Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3946331Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3946495Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3946864Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3947067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3947173Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3947269Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3947366Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3947368Z 2025-12-04T13:44:26.3947659Z [rank3]:[W1204 13:43:16.226050837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3947830Z [rank1]:[W1204 13:43:16.901030972 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3948005Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3948288Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3948463Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3948840Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3949044Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3949149Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3949244Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3949339Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3949343Z 2025-12-04T13:44:26.3949575Z [rank1]:[W1204 13:43:16.902977009 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3949746Z [rank2]:[W1204 13:43:17.202357201 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3949921Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3950180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3950342Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3950718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3950919Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3951027Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3951123Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3951220Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3951223Z 2025-12-04T13:44:26.3951464Z [rank2]:[W1204 13:43:17.204559343 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3951633Z [rank3]:[W1204 13:43:17.226176360 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3951808Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3952082Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3952246Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3952624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3952826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3952931Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3953027Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3953123Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3953125Z 2025-12-04T13:44:26.3953361Z [rank3]:[W1204 13:43:17.227415722 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3953406Z PASSED [181.0533s] [ 36%] 2025-12-04T13:44:26.3953681Z distributed/test_dynamo_distributed.py::TestSingleProc::test_aot_autograd [rank1]:[W1204 13:43:17.903164490 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3953857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3954113Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3954277Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3954645Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3954846Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3954952Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3955046Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3955143Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3955146Z 2025-12-04T13:44:26.3955388Z [rank1]:[W1204 13:43:17.905173296 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3955560Z [rank2]:[W1204 13:43:18.204740154 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3955734Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3955998Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3956169Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3956546Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3956747Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3956850Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3956949Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3957044Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3957048Z 2025-12-04T13:44:26.3957279Z [rank2]:[W1204 13:43:18.206958066 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3957451Z [rank3]:[W1204 13:43:18.227572584 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3957680Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3957941Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3958103Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3958472Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3958674Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3958778Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3958874Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3958969Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3958971Z 2025-12-04T13:44:26.3959204Z [rank3]:[W1204 13:43:18.229600940 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3959389Z [rank1]:[W1204 13:43:18.905291949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3959566Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3959822Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3960018Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3960387Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3960599Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3960703Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3960800Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3960898Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3960900Z 2025-12-04T13:44:26.3961134Z [rank1]:[W1204 13:43:18.907733345 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3961305Z [rank2]:[W1204 13:43:19.207088418 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3961479Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3961733Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3961900Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3962269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3962474Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3962577Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3962673Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3962770Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3962772Z 2025-12-04T13:44:26.3963004Z [rank2]:[W1204 13:43:19.208953567 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3963174Z [rank3]:[W1204 13:43:19.229747952 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3963360Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3963616Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3963788Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3964167Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3964383Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3964487Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3964582Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3964676Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3964680Z 2025-12-04T13:44:26.3964912Z [rank3]:[W1204 13:43:19.231992003 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3965082Z [rank1]:[W1204 13:43:19.907914487 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3965258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3965511Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3965674Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3966042Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3966244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3966348Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3966444Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3966540Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3966541Z 2025-12-04T13:44:26.3966774Z [rank1]:[W1204 13:43:19.910208876 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3966984Z [rank0]:W1204 13:43:20.139000 67577 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T13:44:26.3967165Z [rank2]:[W1204 13:43:20.209116519 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3967340Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3967620Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3967811Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3968178Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3968392Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3968497Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3968593Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3968690Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3968693Z 2025-12-04T13:44:26.3968929Z [rank2]:[W1204 13:43:20.211222183 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3969099Z [rank3]:[W1204 13:43:20.232129425 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3969275Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3969533Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3969698Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3970068Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3970271Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3970376Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3970472Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3970567Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3970570Z 2025-12-04T13:44:26.3970804Z [rank3]:[W1204 13:43:20.234321657 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3970976Z [rank1]:[W1204 13:43:20.910307100 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3971164Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3971420Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3971582Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3971971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3972184Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3972288Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3972384Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3972479Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3972482Z 2025-12-04T13:44:26.3972716Z [rank1]:[W1204 13:43:20.912768936 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3972887Z [rank2]:[W1204 13:43:21.211354555 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3973065Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3973322Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3973484Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3973853Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3974056Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3974161Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3974258Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3974354Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3974355Z 2025-12-04T13:44:26.3974589Z [rank2]:[W1204 13:43:21.213276033 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3974760Z [rank3]:[W1204 13:43:21.234451359 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3974935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3975211Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3975375Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3975753Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3975966Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3976084Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3976178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3976274Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3976276Z 2025-12-04T13:44:26.3976511Z [rank3]:[W1204 13:43:21.237210809 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3976682Z [rank1]:[W1204 13:43:21.912908588 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3976857Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3977112Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3977275Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3977766Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3977972Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3978077Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3978173Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3978269Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3978272Z 2025-12-04T13:44:26.3978506Z [rank1]:[W1204 13:43:21.915127050 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3978679Z [rank2]:[W1204 13:43:22.213424136 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3978852Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3979126Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3979289Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3979675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3979891Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3979994Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3980102Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3980198Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3980200Z 2025-12-04T13:44:26.3980433Z [rank2]:[W1204 13:43:22.215530899 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3980602Z [rank3]:[W1204 13:43:22.237356761 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3980778Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3981034Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3981197Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3981563Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3981767Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3981872Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3981968Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3982065Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3982067Z 2025-12-04T13:44:26.3982298Z [rank3]:[W1204 13:43:22.240170760 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3982469Z [rank1]:[W1204 13:43:22.915254992 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3982646Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3982900Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3983072Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3983438Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3983659Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3983763Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3983859Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3983966Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3983969Z 2025-12-04T13:44:26.3984204Z [rank1]:[W1204 13:43:22.917657890 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3984374Z [rank2]:[W1204 13:43:23.215635173 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3984550Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3984806Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3984969Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3985337Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3985539Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3985643Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3985739Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3985835Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3985837Z 2025-12-04T13:44:26.3986071Z [rank2]:[W1204 13:43:23.218051460 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3986242Z [rank3]:[W1204 13:43:23.240314222 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3986418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3986674Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3986837Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3987215Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3987416Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3987581Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3987676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3987773Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3987793Z 2025-12-04T13:44:26.3988026Z [rank3]:[W1204 13:43:23.242597242 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3988197Z [rank1]:[W1204 13:43:23.917830821 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3988371Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3988628Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3988791Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3989159Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3989362Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3989468Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3989564Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3989659Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3989664Z 2025-12-04T13:44:26.3989898Z [rank1]:[W1204 13:43:23.920085412 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3990067Z [rank2]:[W1204 13:43:24.218191132 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3990240Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3990499Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3990663Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3991044Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3991245Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3991348Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3991468Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3991563Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3991565Z 2025-12-04T13:44:26.3991796Z [rank2]:[W1204 13:43:24.220702267 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3991978Z [rank3]:[W1204 13:43:24.242722145 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3992153Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3992408Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3992572Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3992944Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3993146Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3993250Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3993345Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3993441Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3993443Z 2025-12-04T13:44:26.3993675Z [rank3]:[W1204 13:43:24.244756480 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3993848Z [rank1]:[W1204 13:43:24.920254044 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3994023Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3994280Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3994444Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3994818Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3995022Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3995125Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3995219Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3995337Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3995339Z 2025-12-04T13:44:26.3995571Z [rank1]:[W1204 13:43:24.922704040 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3995752Z [rank2]:[W1204 13:43:25.220833730 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3995925Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3996180Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3996346Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3996713Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3996915Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3997019Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3997115Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3997211Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3997214Z 2025-12-04T13:44:26.3997448Z [rank2]:[W1204 13:43:25.222059633 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3997672Z [rank3]:[W1204 13:43:25.244898343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3997848Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.3998104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.3998268Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3998637Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.3998849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.3998954Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.3999048Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3999144Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.3999157Z 2025-12-04T13:44:26.3999410Z [rank3]:[W1204 13:43:25.246417280 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.3999582Z [rank1]:[W1204 13:43:25.922871272 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.3999775Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4000029Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4000192Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4000558Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4000761Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4000866Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4000963Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4001058Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4001060Z 2025-12-04T13:44:26.4001294Z [rank1]:[W1204 13:43:25.924639173 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4001464Z [rank2]:[W1204 13:43:26.222208625 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4001640Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4001896Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4002057Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4002424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4002628Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4002741Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4002838Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4002934Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4002935Z 2025-12-04T13:44:26.4003180Z [rank2]:[W1204 13:43:26.224140803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4003366Z [rank3]:[W1204 13:43:26.246539063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4003540Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4003810Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4003973Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4004342Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4004544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4004649Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4004744Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4004841Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4004843Z 2025-12-04T13:44:26.4005074Z [rank3]:[W1204 13:43:26.249177755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4005245Z [rank1]:[W1204 13:43:26.924803906 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4005418Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4005673Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4005837Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4006209Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4006412Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4006517Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4006621Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4006718Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4006719Z 2025-12-04T13:44:26.4006950Z [rank1]:[W1204 13:43:26.926541328 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4007140Z [rank2]:[W1204 13:43:27.224226737 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4007315Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4007610Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4007786Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4008157Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4008359Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4008462Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4008558Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4008654Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4008655Z 2025-12-04T13:44:26.4008888Z [rank2]:[W1204 13:43:27.226320931 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4009058Z [rank3]:[W1204 13:43:27.249263499 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4009234Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4009489Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4009653Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4010019Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4010222Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4010327Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4010423Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4010534Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4010536Z 2025-12-04T13:44:26.4010770Z [rank3]:[W1204 13:43:27.251807473 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4010939Z [rank1]:[W1204 13:43:27.926678800 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4011140Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4011397Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4011569Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4011934Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4012139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4012243Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4012337Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4012435Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4012437Z 2025-12-04T13:44:26.4012676Z [rank1]:[W1204 13:43:27.927955092 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4012845Z [rank2]:[W1204 13:43:28.226460274 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4013019Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4013275Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4013437Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4013805Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4014008Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4014112Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4014207Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4014302Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4014305Z 2025-12-04T13:44:26.4014549Z [rank2]:[W1204 13:43:28.228936710 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4014719Z [rank3]:[W1204 13:43:28.251936356 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4014895Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4015174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4015336Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4015716Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4015917Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4016023Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4016120Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4016215Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4016218Z 2025-12-04T13:44:26.4016451Z [rank3]:[W1204 13:43:28.254172887 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4016621Z [rank1]:[W1204 13:43:28.928123485 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4016797Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4017057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4017220Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4017623Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4017823Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4017928Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4018024Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4018119Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4018121Z 2025-12-04T13:44:26.4018367Z [rank1]:[W1204 13:43:28.929830657 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4018538Z [rank2]:[W1204 13:43:29.229091732 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4018714Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4018991Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4019168Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4019551Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4019753Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4019857Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4019954Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4020049Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4020051Z 2025-12-04T13:44:26.4020288Z [rank2]:[W1204 13:43:29.231288634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4020458Z [rank3]:[W1204 13:43:29.254354859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4020634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4020889Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4021051Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4021424Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4021626Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4021731Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4021826Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4021924Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4021926Z 2025-12-04T13:44:26.4022160Z [rank3]:[W1204 13:43:29.256557391 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4022340Z [rank1]:[W1204 13:43:29.929985610 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4022516Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4022771Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4022954Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4023320Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4023534Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4023639Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4023733Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4023833Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4023834Z 2025-12-04T13:44:26.4024065Z [rank1]:[W1204 13:43:29.932492075 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4024237Z [rank2]:[W1204 13:43:30.231422587 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4024411Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4024666Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4024830Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4025196Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4025399Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4025503Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4025600Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4025695Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4025698Z 2025-12-04T13:44:26.4025932Z [rank2]:[W1204 13:43:30.233709697 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4026102Z [rank3]:[W1204 13:43:30.256695154 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4026291Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4026547Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4026708Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4027097Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4027310Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4027414Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4027547Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4027644Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4027647Z 2025-12-04T13:44:26.4027886Z [rank3]:[W1204 13:43:30.258956954 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4028055Z [rank1]:[W1204 13:43:30.932633708 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4028231Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4028484Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4028645Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4029014Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4029216Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4029321Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4029416Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4029511Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4029513Z 2025-12-04T13:44:26.4029747Z [rank1]:[W1204 13:43:30.934797190 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4029919Z [rank2]:[W1204 13:43:31.233828430 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4030096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4030365Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4030529Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4030907Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4031120Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4031236Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4031332Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4031428Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4031430Z 2025-12-04T13:44:26.4031666Z [rank2]:[W1204 13:43:31.235804987 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4031836Z [rank3]:[W1204 13:43:31.259212704 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4032010Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4032270Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4032432Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4032800Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4033001Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4033106Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4033201Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4033297Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4033299Z 2025-12-04T13:44:26.4033533Z [rank3]:[W1204 13:43:31.261414616 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4033705Z [rank1]:[W1204 13:43:31.934952683 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4033879Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4034152Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4034317Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4034693Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4034903Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4035017Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4035112Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4035209Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4035212Z 2025-12-04T13:44:26.4035443Z [rank1]:[W1204 13:43:31.936428851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4035616Z [rank2]:[W1204 13:43:32.235916671 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4035790Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4036048Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4036212Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4036584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4036788Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4036891Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4036989Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4037085Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4037087Z 2025-12-04T13:44:26.4037320Z [rank2]:[W1204 13:43:32.237983945 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4037526Z [rank3]:[W1204 13:43:32.261552629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4037701Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4037961Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4038136Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4038503Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4038733Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4038839Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4038946Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4039042Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4039044Z 2025-12-04T13:44:26.4039277Z [rank3]:[W1204 13:43:32.263767181 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4039446Z [rank1]:[W1204 13:43:32.936577784 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4039622Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4039876Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4040040Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4040406Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4040610Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4040714Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4040808Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4040905Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4040907Z 2025-12-04T13:44:26.4041143Z [rank1]:[W1204 13:43:33.938908513 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4041314Z [rank2]:[W1204 13:43:33.238219266 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4041489Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4041745Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4041909Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4042285Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4042489Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4042613Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4042711Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4042806Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4042818Z 2025-12-04T13:44:26.4043055Z [rank2]:[W1204 13:43:33.240438158 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4043229Z [rank3]:[W1204 13:43:33.263920424 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4043402Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4043661Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4043823Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4044190Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4044391Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4044496Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4044591Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4044686Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4044689Z 2025-12-04T13:44:26.4044925Z [rank3]:[W1204 13:43:33.265868231 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4045093Z [rank1]:[W1204 13:43:34.939055906 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4045269Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4045525Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4045687Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4046063Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4046266Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4046369Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4046486Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4046583Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4046585Z 2025-12-04T13:44:26.4046834Z [rank1]:[W1204 13:43:34.941245528 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4047004Z [rank2]:[W1204 13:43:34.240592431 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4047178Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4047439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4047640Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4048008Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4048212Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4048314Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4048412Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4048507Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4048509Z 2025-12-04T13:44:26.4048741Z [rank2]:[W1204 13:43:34.242857071 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4048912Z [rank3]:[W1204 13:43:34.266018434 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4049086Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4049341Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4049502Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4049893Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4050093Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4050199Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4050295Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4050414Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4050416Z 2025-12-04T13:44:26.4050649Z [rank3]:[W1204 13:43:34.268143837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4050835Z [rank1]:[W1204 13:43:35.941437940 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4051009Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4051263Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4051427Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4051794Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4051997Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4052102Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4052196Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4052294Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4052296Z 2025-12-04T13:44:26.4052528Z [rank1]:[W1204 13:43:35.943268909 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4052700Z [rank2]:[W1204 13:43:35.243012964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4052873Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4053128Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4053294Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4053660Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4053874Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4053978Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4054074Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4054171Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4054182Z 2025-12-04T13:44:26.4054428Z [rank2]:[W1204 13:43:35.244237237 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4054600Z [rank3]:[W1204 13:43:35.268303770 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4054787Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4055043Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4055203Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4055576Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4055778Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4055882Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4055977Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4056072Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4056076Z 2025-12-04T13:44:26.4056313Z [rank3]:[W1204 13:43:35.269586812 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4056483Z [rank1]:[W1204 13:43:36.943462422 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4056660Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4056913Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4057075Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4057443Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4057681Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4057796Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4057891Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4057987Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4057989Z 2025-12-04T13:44:26.4058234Z [rank1]:[W1204 13:43:36.945198454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4058418Z [rank2]:[W1204 13:43:36.244419010 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4058606Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4058863Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4059026Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4059393Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4059594Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4059699Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4059796Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4059891Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4059893Z 2025-12-04T13:44:26.4060127Z [rank2]:[W1204 13:43:36.246270799 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4060298Z [rank3]:[W1204 13:43:36.269715346 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4060472Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4060734Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4060895Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4061261Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4061462Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4061568Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4061676Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4061771Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4061773Z 2025-12-04T13:44:26.4062007Z [rank3]:[W1204 13:43:36.271184523 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4062201Z [rank1]:[W1204 13:43:37.945409175 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4062375Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4062642Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4062805Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4063176Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4063378Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4063485Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4063581Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4063677Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4063679Z 2025-12-04T13:44:26.4063911Z [rank1]:[W1204 13:43:37.947234715 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4064084Z [rank2]:[W1204 13:43:37.246445632 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4064258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4064513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4064679Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4065049Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4065254Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4065358Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4065455Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4065560Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4065562Z 2025-12-04T13:44:26.4065795Z [rank2]:[W1204 13:43:37.248137005 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4065966Z [rank3]:[W1204 13:43:37.271341556 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4066161Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4066417Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4066589Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4066956Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4067158Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4067265Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4067361Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4067457Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4067459Z 2025-12-04T13:44:26.4067735Z [rank3]:[W1204 13:43:37.273492039 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4067904Z [rank1]:[W1204 13:43:38.947380949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4068082Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4068335Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4068499Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4068866Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4069067Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4069172Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4069266Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4069365Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4069367Z 2025-12-04T13:44:26.4069614Z [rank1]:[W1204 13:43:38.949482543 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4069786Z [rank2]:[W1204 13:43:38.248284878 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4069972Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4070240Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4070417Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4070782Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4070984Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4071089Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4071185Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4071281Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4071285Z 2025-12-04T13:44:26.4071520Z [rank2]:[W1204 13:43:38.250388532 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4071691Z [rank3]:[W1204 13:43:38.273616443 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4071865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4072122Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4072283Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4072651Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4072854Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4072958Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4073054Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4073149Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4073152Z 2025-12-04T13:44:26.4073395Z [rank3]:[W1204 13:43:38.275617189 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4073564Z [rank1]:[W1204 13:43:39.949608197 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4073739Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4074014Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4074177Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4074556Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4074757Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4074862Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4074957Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4075052Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4075054Z 2025-12-04T13:44:26.4075287Z [rank1]:[W1204 13:43:39.951616893 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4075459Z [rank2]:[W1204 13:43:39.250553295 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4075634Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4075893Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4076061Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4076426Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4076629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4076732Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4076830Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4076925Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4076929Z 2025-12-04T13:44:26.4077160Z [rank2]:[W1204 13:43:39.252690048 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4077340Z [rank3]:[W1204 13:43:39.275726633 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4077548Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4077806Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4077998Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4078371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4078586Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4078690Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4078785Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4078882Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4078884Z 2025-12-04T13:44:26.4079117Z [rank3]:[W1204 13:43:39.277844427 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4079287Z [rank1]:[W1204 13:43:40.951797245 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4079461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4079715Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4079878Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4080248Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4080452Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4080556Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4080651Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4080749Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4080753Z 2025-12-04T13:44:26.4080986Z [rank1]:[W1204 13:43:40.953733953 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4081157Z [rank2]:[W1204 13:43:40.252831531 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4081344Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4081599Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4081772Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4082153Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4082368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4082472Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4082568Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4082663Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4082667Z 2025-12-04T13:44:26.4082903Z [rank2]:[W1204 13:43:40.254580833 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4083073Z [rank3]:[W1204 13:43:40.277999710 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4083248Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4083503Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4083665Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4084035Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4084238Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4084341Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4084437Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4084531Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4084534Z 2025-12-04T13:44:26.4084772Z [rank3]:[W1204 13:43:40.280251911 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4084941Z [rank1]:[W1204 13:43:41.953906586 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4085116Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4085379Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4085542Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4085932Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4086142Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4086248Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4086343Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4086440Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4086441Z 2025-12-04T13:44:26.4086674Z [rank1]:[W1204 13:43:41.956374982 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4086845Z [rank2]:[W1204 13:43:41.254772686 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4087022Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4087278Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4087441Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4087850Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4088056Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4088160Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4088256Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4088353Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4088355Z 2025-12-04T13:44:26.4088586Z [rank2]:[W1204 13:43:41.256790252 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4088758Z [rank3]:[W1204 13:43:41.280411164 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4088931Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4089206Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4089368Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4089746Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4089962Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4090081Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4090178Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4090273Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4090274Z 2025-12-04T13:44:26.4090506Z [rank3]:[W1204 13:43:41.282682634 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4090677Z [rank1]:[W1204 13:43:42.956563964 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4090850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4091108Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4091272Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4091641Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4091841Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4091945Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4092043Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4092139Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4092141Z 2025-12-04T13:44:26.4092372Z [rank1]:[W1204 13:43:42.958445783 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4092544Z [rank2]:[W1204 13:43:42.256962345 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4092718Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4092973Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4093155Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4093529Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4093751Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4093856Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4093963Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4094060Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4094062Z 2025-12-04T13:44:26.4094293Z [rank2]:[W1204 13:43:42.259363362 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4094463Z [rank3]:[W1204 13:43:42.282874137 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4094640Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4094895Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4095058Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4095425Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4095629Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4095733Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4095829Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4095924Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4095927Z 2025-12-04T13:44:26.4096160Z [rank3]:[W1204 13:43:42.284790415 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4096329Z [rank1]:[W1204 13:43:43.958606776 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4096506Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4096760Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4096934Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4097302Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4097550Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4097668Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4097763Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4097872Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4097875Z 2025-12-04T13:44:26.4098108Z [rank1]:[W1204 13:43:43.960509755 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4098279Z [rank2]:[W1204 13:43:43.259519155 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4098453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4098709Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4098873Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4099241Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4099443Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4099548Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4099644Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4099742Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4099745Z 2025-12-04T13:44:26.4099979Z [rank2]:[W1204 13:43:43.261666008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4100152Z [rank3]:[W1204 13:43:43.284924379 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4100324Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4100581Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4100742Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4101120Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4101322Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4101446Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4101542Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4101636Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4101649Z 2025-12-04T13:44:26.4101886Z [rank3]:[W1204 13:43:43.286758719 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4102056Z [rank1]:[W1204 13:43:44.960941102 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4102232Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4102491Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4102653Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4103020Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4103221Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4103324Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4103420Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4103517Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4103519Z 2025-12-04T13:44:26.4103753Z [rank1]:[W1204 13:43:44.963298620 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4103925Z [rank2]:[W1204 13:43:44.261833222 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4104099Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4104356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4104520Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4104901Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4105104Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4105208Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4105325Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4105422Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4105424Z 2025-12-04T13:44:26.4105659Z [rank2]:[W1204 13:43:44.263951605 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4105847Z [rank3]:[W1204 13:43:44.286890473 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4106020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4106276Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4106441Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4106817Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4107018Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4107121Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4107216Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4107313Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4107315Z 2025-12-04T13:44:26.4107571Z [rank3]:[W1204 13:43:44.288878759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4107743Z [rank1]:[W1204 13:43:45.963463114 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4107917Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4108171Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4108334Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4108708Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4108927Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4109032Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4109126Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4109235Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4109248Z 2025-12-04T13:44:26.4109483Z [rank1]:[W1204 13:43:45.965616837 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4109666Z [rank2]:[W1204 13:43:45.264111819 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4109843Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4110096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4110260Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4110629Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4110832Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4110936Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4111033Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4111129Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4111131Z 2025-12-04T13:44:26.4111365Z [rank2]:[W1204 13:43:45.266508686 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4111535Z [rank3]:[W1204 13:43:45.289143110 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4111709Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4111967Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4112128Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4112496Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4112710Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4112815Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4112911Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4113006Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4113008Z 2025-12-04T13:44:26.4113263Z [rank3]:[W1204 13:43:45.291290803 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4113433Z [rank1]:[W1204 13:43:46.965771960 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4113619Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4113875Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4114036Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4114404Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4114604Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4114711Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4114805Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4114902Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4114904Z 2025-12-04T13:44:26.4115139Z [rank1]:[W1204 13:43:46.967495902 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4115310Z [rank2]:[W1204 13:43:46.266688119 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4115486Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4115740Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4115902Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4116269Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4116470Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4116586Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4116682Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4116778Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4116780Z 2025-12-04T13:44:26.4117012Z [rank2]:[W1204 13:43:46.269083127 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4117205Z [rank3]:[W1204 13:43:46.291446387 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4117380Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4117700Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4117861Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4118226Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4118428Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4118532Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4118628Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4118723Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4118725Z 2025-12-04T13:44:26.4118958Z [rank3]:[W1204 13:43:46.293711247 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4119129Z [rank1]:[W1204 13:43:47.967651846 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4119302Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4119557Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4119719Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4120087Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4120288Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4120394Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4120488Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4120598Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4120600Z 2025-12-04T13:44:26.4120835Z [rank1]:[W1204 13:43:47.969083395 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4121018Z [rank2]:[W1204 13:43:47.269265880 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4121212Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4121466Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4121643Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4122013Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4122219Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4122324Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4122419Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4122516Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4122518Z 2025-12-04T13:44:26.4122752Z [rank2]:[W1204 13:43:47.271697867 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4122922Z [rank3]:[W1204 13:43:47.293862791 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4123098Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4123352Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4123516Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4123882Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4124087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4124191Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4124285Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4124381Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4124393Z 2025-12-04T13:44:26.4124627Z [rank3]:[W1204 13:43:47.295703061 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4124795Z [rank1]:[W1204 13:43:48.969242118 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4124989Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4125244Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4125417Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4125785Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4125987Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4126092Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4126186Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4126284Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4126287Z 2025-12-04T13:44:26.4126521Z [rank1]:[W1204 13:43:48.971399631 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4126690Z [rank2]:[W1204 13:43:48.271869010 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4126863Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4127119Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4127282Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4127687Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4127888Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4127995Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4128090Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4128186Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4128189Z 2025-12-04T13:44:26.4128439Z [rank2]:[W1204 13:43:48.274090581 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4128613Z [rank3]:[W1204 13:43:48.295855475 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4128786Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4129066Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4129229Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4129610Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4129811Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4129916Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4130012Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4130106Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4130109Z 2025-12-04T13:44:26.4130343Z [rank3]:[W1204 13:43:48.297790962 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4130514Z [rank1]:[W1204 13:43:49.971511416 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4130694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4130950Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4131111Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4131479Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4131679Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4131784Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4131879Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4131975Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4131977Z 2025-12-04T13:44:26.4132209Z [rank1]:[W1204 13:43:49.972971294 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4132389Z [rank2]:[W1204 13:43:49.274273145 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4132564Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4132830Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4133004Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4133371Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4133584Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4133688Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4133784Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4133882Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4133884Z 2025-12-04T13:44:26.4134115Z [rank2]:[W1204 13:43:49.276556595 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4134287Z [rank3]:[W1204 13:43:49.297946516 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4134461Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4134717Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4134881Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4135250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4135453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4135556Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4135650Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4135746Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4135748Z 2025-12-04T13:44:26.4135981Z [rank3]:[W1204 13:43:49.299836645 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4136160Z [rank1]:[W1204 13:43:50.973122838 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4136334Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4136590Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4136777Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4137147Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4137366Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4137470Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4137614Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4137710Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4137713Z 2025-12-04T13:44:26.4137945Z [rank1]:[W1204 13:43:50.974345941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4138118Z [rank2]:[W1204 13:43:50.276730598 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4138292Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4138546Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4138710Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4139078Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4139282Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4139386Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4139482Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4139579Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4139582Z 2025-12-04T13:44:26.4139815Z [rank2]:[W1204 13:43:50.278730894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4139984Z [rank3]:[W1204 13:43:50.299994449 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4140173Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4140429Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4140591Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4140982Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4141195Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4141299Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4141394Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4141488Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4141490Z 2025-12-04T13:44:26.4141726Z [rank3]:[W1204 13:43:50.302449185 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4141898Z [rank1]:[W1204 13:43:51.974489795 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4142072Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4142328Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4142490Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4142857Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4143056Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4143163Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4143258Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4143353Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4143355Z 2025-12-04T13:44:26.4143589Z [rank1]:[W1204 13:43:51.975735628 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4143760Z [rank2]:[W1204 13:43:51.278892668 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4143935Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4144201Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4144366Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4144741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4144952Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4145066Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4145162Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4145259Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4145260Z 2025-12-04T13:44:26.4145492Z [rank2]:[W1204 13:43:51.281260816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4145664Z [rank3]:[W1204 13:43:51.302581859 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4145837Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4146096Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4146259Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4146624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4146826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4146930Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4147027Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4147123Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4147124Z 2025-12-04T13:44:26.4147356Z [rank3]:[W1204 13:43:51.304299162 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4147560Z [rank1]:[W1204 13:43:52.975885762 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4147735Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4148002Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4148166Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4148544Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4148756Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4148861Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4148968Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4149064Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4149066Z 2025-12-04T13:44:26.4149299Z [rank1]:[W1204 13:43:52.977137085 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4149468Z [rank2]:[W1204 13:43:52.281437090 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4149644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4149897Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4150062Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4150433Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4150637Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4150741Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4150836Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4150933Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4150935Z 2025-12-04T13:44:26.4151168Z [rank2]:[W1204 13:43:52.283905756 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4151338Z [rank3]:[W1204 13:43:52.304429606 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4151513Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4151768Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4151951Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4152319Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4152544Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4152647Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4152742Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4152850Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4152852Z 2025-12-04T13:44:26.4153086Z [rank3]:[W1204 13:43:52.306486671 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4153256Z [rank1]:[W1204 13:43:53.977307059 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4153431Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4153686Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4153848Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4154215Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4154416Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4154521Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4154617Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4158526Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4158529Z 2025-12-04T13:44:26.4158772Z [rank1]:[W1204 13:43:53.979079340 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4158943Z [rank2]:[W1204 13:43:53.284078309 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4159118Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4159378Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4159542Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4159939Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4160141Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4160276Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4160371Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4160468Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4160486Z 2025-12-04T13:44:26.4160722Z [rank2]:[W1204 13:43:53.286383759 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4160894Z [rank3]:[W1204 13:43:53.306612886 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4161067Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4161326Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4161489Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4161859Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4162061Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4162166Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4162262Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4162357Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4162359Z 2025-12-04T13:44:26.4162595Z [rank3]:[W1204 13:43:53.308605322 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4162766Z [rank1]:[W1204 13:43:54.979207084 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4162942Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4163200Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4163363Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4163740Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4163941Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4164046Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4164163Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4164258Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4164260Z 2025-12-04T13:44:26.4164492Z [rank1]:[W1204 13:43:54.980707972 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4164672Z [rank2]:[W1204 13:43:54.286549203 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4164846Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4165104Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4165270Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4165638Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4165839Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4165944Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4166039Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4166137Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4166139Z 2025-12-04T13:44:26.4166371Z [rank2]:[W1204 13:43:54.288492400 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4166544Z [rank3]:[W1204 13:43:54.308712747 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4166717Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4166975Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4167139Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4167569Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4167772Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4167875Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4167970Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4168090Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4168092Z 2025-12-04T13:44:26.4168325Z [rank3]:[W1204 13:43:54.310856510 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4168517Z [rank1]:[W1204 13:43:55.980843866 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4168691Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4168946Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4169109Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4169482Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4169685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4169789Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4169885Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4169980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4169982Z 2025-12-04T13:44:26.4170216Z [rank1]:[W1204 13:43:55.982090259 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4170386Z [rank2]:[W1204 13:43:55.288650555 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4170561Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4170815Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4170977Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4171346Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4171559Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4171664Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4171760Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4171856Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4171868Z 2025-12-04T13:44:26.4172110Z [rank2]:[W1204 13:43:55.290464555 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4172281Z [rank3]:[W1204 13:43:55.310971826 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4172465Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4172720Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4172882Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4173250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4173453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4173556Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4173651Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4173745Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4173747Z 2025-12-04T13:44:26.4173986Z [rank3]:[W1204 13:43:55.313081160 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4174157Z [rank1]:[W1204 13:43:56.982253063 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4174332Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4174587Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4174747Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4175113Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4175315Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4175428Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4175524Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4175619Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4175621Z 2025-12-04T13:44:26.4175862Z [rank1]:[W1204 13:43:56.983500766 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4176043Z [rank2]:[W1204 13:43:56.290937912 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4176217Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4176484Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4176647Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4177021Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4177223Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4177328Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4177423Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4177567Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4177569Z 2025-12-04T13:44:26.4177801Z [rank2]:[W1204 13:43:56.293419778 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4177973Z [rank3]:[W1204 13:43:56.313232394 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4178147Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4178409Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4178571Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4178936Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4179139Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4179242Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4179354Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4179449Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4179451Z 2025-12-04T13:44:26.4179683Z [rank3]:[W1204 13:43:56.315195791 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4179883Z [rank1]:[W1204 13:43:57.983625751 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4180057Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4180313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4180488Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4180856Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4181058Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4181161Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4181256Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4181351Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4181353Z 2025-12-04T13:44:26.4181586Z [rank1]:[W1204 13:43:57.984855454 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4181755Z [rank2]:[W1204 13:43:57.293570292 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4181929Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4182182Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4182348Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4182718Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4182922Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4183026Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4183120Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4183225Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4183227Z 2025-12-04T13:44:26.4183460Z [rank2]:[W1204 13:43:57.295256815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4183632Z [rank3]:[W1204 13:43:57.315352565 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4183833Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4184089Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4184263Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4184627Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4184831Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4184934Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4185029Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4185124Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4185127Z 2025-12-04T13:44:26.4185360Z [rank3]:[W1204 13:43:57.317190505 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4185530Z [rank1]:[W1204 13:43:58.984994949 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4185704Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4185957Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4186118Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4186483Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4186685Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4186789Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4186884Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4186980Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4186982Z 2025-12-04T13:44:26.4187225Z [rank1]:[W1204 13:43:58.986928286 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4187395Z [rank2]:[W1204 13:43:58.295430629 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4187605Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4187894Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4188058Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4188440Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4188640Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4188746Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4188840Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4188936Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4188939Z 2025-12-04T13:44:26.4189174Z [rank2]:[W1204 13:43:58.297380087 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4189346Z [rank3]:[W1204 13:43:58.317340739 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4189520Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4189776Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4189938Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4190307Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4190508Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4190612Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4190709Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4190804Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4190807Z 2025-12-04T13:44:26.4191073Z [rank3]:[W1204 13:43:58.319025712 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4191247Z [rank1]:[W1204 13:43:59.987046831 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4191422Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4191692Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4191866Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4192247Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4192449Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4192551Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4192647Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4192742Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4192744Z 2025-12-04T13:44:26.4192977Z [rank1]:[W1204 13:43:59.988984399 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4193148Z [rank2]:[W1204 13:43:59.297529901 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4193322Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4193580Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4193745Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4194113Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4194314Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4194418Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4194512Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4194611Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4194613Z 2025-12-04T13:44:26.4194845Z [rank2]:[W1204 13:43:59.298755014 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4195032Z [rank3]:[W1204 13:43:59.319165047 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4195207Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4195461Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4195644Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4196014Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4196228Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4196331Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4196425Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4196524Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4196526Z 2025-12-04T13:44:26.4196758Z [rank3]:[W1204 13:43:59.320579026 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4196928Z [rank1]:[W1204 13:44:00.989161343 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4197101Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4197356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4197571Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4197940Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4198144Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4198247Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4198341Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4198435Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4198438Z 2025-12-04T13:44:26.4198671Z [rank1]:[W1204 13:44:00.991349515 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4198840Z [rank2]:[W1204 13:44:00.298920429 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4199040Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4199297Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4199460Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4199869Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4200088Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4200193Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4200287Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4200384Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4200387Z 2025-12-04T13:44:26.4200619Z [rank2]:[W1204 13:44:00.301663489 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4200789Z [rank3]:[W1204 13:44:00.320726731 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4200965Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4201218Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4201380Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4201747Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4201949Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4202052Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4202148Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4202244Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4202246Z 2025-12-04T13:44:26.4202481Z [rank3]:[W1204 13:44:00.322709308 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4202652Z [rank1]:[W1204 13:44:01.991499280 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4202825Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4203093Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4203254Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4203631Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4203843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4203958Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4204054Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4204148Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4204150Z 2025-12-04T13:44:26.4204384Z [rank1]:[W1204 13:44:01.992752832 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4204558Z [rank2]:[W1204 13:44:01.301808443 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4204736Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4204992Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4205155Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4205525Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4205726Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4205830Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4205925Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4206022Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4206023Z 2025-12-04T13:44:26.4206256Z [rank2]:[W1204 13:44:01.303161874 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4206427Z [rank3]:[W1204 13:44:01.322849053 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4206602Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4206869Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4207031Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4207408Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4207645Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4207768Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4207864Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4207959Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4207961Z 2025-12-04T13:44:26.4208193Z [rank3]:[W1204 13:44:01.324880078 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4208364Z [rank1]:[W1204 13:44:02.992934456 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4208537Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4208793Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4208957Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4209327Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4209529Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4209632Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4209728Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4209823Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4209825Z 2025-12-04T13:44:26.4210058Z [rank1]:[W1204 13:44:02.994875974 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4210229Z [rank2]:[W1204 13:44:02.303334578 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4210405Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4210661Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4210837Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4211208Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4211440Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4211544Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4211648Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4211746Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4211747Z 2025-12-04T13:44:26.4211979Z [rank2]:[W1204 13:44:02.305678147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4212149Z [rank3]:[W1204 13:44:02.325052512 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4212325Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4212579Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4212743Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4213107Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4213311Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4213415Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4213509Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4213605Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4213607Z 2025-12-04T13:44:26.4213839Z [rank3]:[W1204 13:44:02.326832243 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4214009Z [rank1]:[W1204 13:44:03.995008579 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4214184Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4214438Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4214600Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4214981Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4215185Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4215317Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4215414Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4215510Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4215522Z 2025-12-04T13:44:26.4215757Z [rank1]:[W1204 13:44:03.996878008 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4215926Z [rank2]:[W1204 13:44:03.305834491 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4216100Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4216356Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4216517Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4216887Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4217087Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4217193Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4217287Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4217383Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4217386Z 2025-12-04T13:44:26.4217655Z [rank2]:[W1204 13:44:03.307634032 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4217827Z [rank3]:[W1204 13:44:03.326991438 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4218003Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4218258Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4218421Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4218804Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4219006Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4219127Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4219233Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4219329Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4219331Z 2025-12-04T13:44:26.4219576Z [rank3]:[W1204 13:44:03.329278798 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4219748Z [rank1]:[W1204 13:44:04.997029703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4219923Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4220178Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4220340Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4220707Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4220909Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4221012Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4221111Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4221206Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4221209Z 2025-12-04T13:44:26.4221441Z [rank1]:[W1204 13:44:04.999412981 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4221613Z [rank2]:[W1204 13:44:04.307802107 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4221788Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4222045Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4222208Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4222584Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4222786Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4222891Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4222985Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4223102Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4223105Z 2025-12-04T13:44:26.4223337Z [rank2]:[W1204 13:44:04.310103816 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4223520Z [rank3]:[W1204 13:44:04.329429953 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4223694Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4223949Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4224113Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4224481Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4224684Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4224789Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4224884Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4224982Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4224984Z 2025-12-04T13:44:26.4225216Z [rank3]:[W1204 13:44:04.331616165 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4225262Z PASSED [47.2279s] [ 38%] 2025-12-04T13:44:26.4225409Z distributed/test_dynamo_distributed.py::TestSingleProc::test_async_subclass_no_specialize PASSED [0.0351s] [ 41%] 2025-12-04T13:44:26.4225717Z distributed/test_dynamo_distributed.py::TestSingleProc::test_compiled_flex_attention_full_model_ddp [rank1]:[W1204 13:44:05.999518877 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4225890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4226148Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4226313Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4226696Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4226898Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4227024Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4227121Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4227216Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4227218Z 2025-12-04T13:44:26.4227469Z [rank1]:[W1204 13:44:05.000780939 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4227682Z [rank2]:[W1204 13:44:05.310260121 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4227855Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4228110Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4228272Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4228644Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4228849Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4228953Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4229050Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4229145Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4229147Z 2025-12-04T13:44:26.4229379Z [rank2]:[W1204 13:44:05.312708147 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4229552Z [rank3]:[W1204 13:44:05.331760320 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4229727Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4229984Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4230148Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4230527Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4230729Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4230833Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4230942Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4231050Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4231052Z 2025-12-04T13:44:26.4231284Z [rank3]:[W1204 13:44:05.333975181 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4231473Z [rank1]:[W1204 13:44:06.000874805 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4231647Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4231903Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4232067Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4232432Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4232634Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4232737Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4232832Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4232930Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4232931Z 2025-12-04T13:44:26.4233165Z [rank1]:[W1204 13:44:06.002041620 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4233337Z [rank2]:[W1204 13:44:06.312870642 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4233511Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4233764Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4233928Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4234296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4234508Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4234614Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4234708Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4234804Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4234818Z 2025-12-04T13:44:26.4235064Z [rank2]:[W1204 13:44:06.315262490 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4235235Z [rank3]:[W1204 13:44:06.334121597 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4235423Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4235677Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4235840Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4236207Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4236409Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4236513Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4236607Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4236703Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4236706Z 2025-12-04T13:44:26.4236938Z [rank3]:[W1204 13:44:06.336352598 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4237110Z [rank1]:[W1204 13:44:07.002197945 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4237285Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4237589Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4237753Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4238121Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4238336Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4238439Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4238534Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4239786Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4239788Z 2025-12-04T13:44:26.4240059Z [rank1]:[W1204 13:44:07.003511576 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4240101Z PASSED [2.1229s] [ 44%] 2025-12-04T13:44:26.4240405Z distributed/test_dynamo_distributed.py::TestSingleProc::test_compiled_flex_attention_local_ddp [rank2]:[W1204 13:44:07.315433844 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4240581Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4240839Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4241015Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4241385Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4241590Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4241693Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4241789Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4241886Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4241888Z 2025-12-04T13:44:26.4242123Z [rank2]:[W1204 13:44:07.317492309 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4242293Z [rank3]:[W1204 13:44:07.336513413 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4242468Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4242723Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4242886Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4243258Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4243471Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4243576Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4243671Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4243826Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4243828Z 2025-12-04T13:44:26.4244090Z [rank3]:[W1204 13:44:07.338747414 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4244132Z PASSED [0.3383s] [ 47%] 2025-12-04T13:44:26.4244256Z distributed/test_dynamo_distributed.py::TestSingleProc::test_custom_layer PASSED [0.1066s] [ 50%] 2025-12-04T13:44:26.4244390Z distributed/test_dynamo_distributed.py::TestSingleProc::test_ddp_baseline_aot_eager PASSED [0.2240s] [ 52%] 2025-12-04T13:44:26.4244676Z distributed/test_dynamo_distributed.py::TestSingleProc::test_ddp_baseline_inductor [rank1]:[W1204 13:44:08.003665271 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4244850Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4245106Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4245269Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4245639Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4245843Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4245946Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4246044Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4246139Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4246141Z 2025-12-04T13:44:26.4246375Z [rank1]:[W1204 13:44:08.005083080 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4246545Z [rank2]:[W1204 13:44:08.317652424 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4246720Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4246975Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4247140Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4247587Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4247790Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4247916Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4248024Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4248132Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4248134Z 2025-12-04T13:44:26.4248367Z [rank2]:[W1204 13:44:08.319876905 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4248539Z [rank3]:[W1204 13:44:08.338895139 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4248713Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4248971Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4249135Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4249502Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4249705Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4249809Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4249906Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4250001Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4250006Z 2025-12-04T13:44:26.4250237Z [rank3]:[W1204 13:44:08.341184619 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4250278Z PASSED [0.6619s] [ 55%] 2025-12-04T13:44:26.4250413Z distributed/test_dynamo_distributed.py::TestSingleProc::test_empty_graph_inductor PASSED [0.0062s] [ 58%] 2025-12-04T13:44:26.4250553Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_dup_tensors_diff_source PASSED [0.1748s] [ 61%] 2025-12-04T13:44:26.4250689Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_dup_tensors_same_source PASSED [0.1329s] [ 63%] 2025-12-04T13:44:26.4250823Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_orig_params_assert PASSED [0.1007s] [ 66%] 2025-12-04T13:44:26.4251099Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_skip_guards [rank1]:[W1204 13:44:09.005259915 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4251275Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4251538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4251703Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4252096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4252306Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4252411Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4252506Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4252604Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4252607Z 2025-12-04T13:44:26.4252838Z [rank1]:[W1204 13:44:09.007212892 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4253014Z [rank2]:[W1204 13:44:09.320041210 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4253190Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4253446Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4253610Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4253975Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4254181Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4254284Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4254380Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4254477Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4254480Z 2025-12-04T13:44:26.4254712Z [rank2]:[W1204 13:44:09.322279251 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4254883Z [rank3]:[W1204 13:44:09.341325214 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4255058Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4255313Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4255488Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4255856Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4256094Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4256198Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4256294Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4256391Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4256392Z 2025-12-04T13:44:26.4256625Z [rank3]:[W1204 13:44:09.343557935 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4256665Z PASSED [0.8429s] [ 69%] 2025-12-04T13:44:26.4256816Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_skip_register_attr_or_module PASSED [0.2598s] [ 72%] 2025-12-04T13:44:26.4257097Z distributed/test_dynamo_distributed.py::TestSingleProc::test_fsdp_staticmethod [rank1]:[W1204 13:44:10.007394816 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4257271Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4257567Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4257729Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4258098Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4258301Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4258408Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4258504Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4258600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4258602Z 2025-12-04T13:44:26.4258837Z [rank1]:[W1204 13:44:10.009560149 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4258877Z PASSED [0.2647s] [ 75%] 2025-12-04T13:44:26.4259148Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split [rank2]:[W1204 13:44:10.322451026 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4259323Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4259593Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4259755Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4260160Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4260375Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4260480Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4260576Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4260672Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4260675Z 2025-12-04T13:44:26.4260913Z [rank2]:[W1204 13:44:10.324738326 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4261083Z [rank3]:[W1204 13:44:10.343693461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4261258Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4261513Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4261675Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4262043Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4262244Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4262348Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4262442Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4262539Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4262541Z 2025-12-04T13:44:26.4262774Z [rank3]:[W1204 13:44:10.345892113 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4262815Z PASSED [0.2829s] [ 77%] 2025-12-04T13:44:26.4263105Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_ctx_manager [rank1]:[W1204 13:44:11.009737934 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4263279Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4263542Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4263705Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4264094Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4264304Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4264410Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4264504Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4264600Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4264603Z 2025-12-04T13:44:26.4264836Z [rank1]:[W1204 13:44:11.011571923 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4265008Z [rank2]:[W1204 13:44:11.324920841 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4265182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4265439Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4265601Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4265971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4266174Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4266278Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4266375Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4266473Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4266475Z 2025-12-04T13:44:26.4266709Z [rank2]:[W1204 13:44:11.327113183 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4266880Z [rank3]:[W1204 13:44:11.346061108 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4267054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4267324Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4267527Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4267930Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4268143Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4268246Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4268342Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4268438Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4268439Z 2025-12-04T13:44:26.4268672Z [rank3]:[W1204 13:44:11.348223820 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4268844Z [rank1]:[W1204 13:44:12.011732399 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4269017Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4269274Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4269436Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4269808Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4270011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4270116Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4270210Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4270308Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4270310Z 2025-12-04T13:44:26.4270543Z [rank1]:[W1204 13:44:12.013664136 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4270714Z [rank2]:[W1204 13:44:12.327292117 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4270890Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4271143Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4271318Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4271687Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4271923Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4272027Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4272123Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4272221Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4272223Z 2025-12-04T13:44:26.4272455Z [rank2]:[W1204 13:44:12.329684295 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4272627Z [rank3]:[W1204 13:44:12.348374446 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4272801Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4273057Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4273223Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4273592Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4273796Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4273900Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4273996Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4274091Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4274093Z 2025-12-04T13:44:26.4274328Z [rank3]:[W1204 13:44:12.350577567 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4274369Z PASSED [1.8749s] [ 80%] 2025-12-04T13:44:26.4274656Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor [rank1]:[W1204 13:44:13.013836461 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4274831Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4275087Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4275270Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4275637Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4275869Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4275974Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4276068Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4276165Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4276167Z 2025-12-04T13:44:26.4276400Z [rank1]:[W1204 13:44:13.015691941 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4276440Z PASSED [0.6806s] [ 83%] 2025-12-04T13:44:26.4276769Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_layout_optimizations_inference [rank2]:[W1204 13:44:13.329860480 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4276945Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4277202Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4277365Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4277788Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4277992Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4278095Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4278190Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4278286Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4278288Z 2025-12-04T13:44:26.4278525Z [rank2]:[W1204 13:44:13.332058552 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4278700Z [rank3]:[W1204 13:44:13.350724823 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4278876Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4279131Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4279310Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4279677Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4279920Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4280024Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4280120Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4280215Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4280218Z 2025-12-04T13:44:26.4280449Z [rank3]:[W1204 13:44:13.352834467 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4280621Z [rank1]:[W1204 13:44:14.015779827 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4280798Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4281054Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4281216Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4281582Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4281787Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4281892Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4281987Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4282082Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4282085Z 2025-12-04T13:44:26.4282317Z [rank1]:[W1204 13:44:14.016944072 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4282493Z [rank2]:[W1204 13:44:14.332201968 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4282669Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4282927Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4283090Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4283467Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4283689Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4283805Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4283900Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4283995Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4283997Z 2025-12-04T13:44:26.4284230Z [rank2]:[W1204 13:44:14.333390961 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4284403Z [rank3]:[W1204 13:44:14.353019712 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4284580Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4284836Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4285000Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4285373Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4285575Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4285682Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4285776Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4285872Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4285874Z 2025-12-04T13:44:26.4286107Z [rank3]:[W1204 13:44:14.355066297 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4286278Z [rank1]:[W1204 13:44:15.017085648 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4286453Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4286710Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4286871Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4287250Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4287453Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4287639Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4287736Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4287830Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4287833Z 2025-12-04T13:44:26.4288067Z [rank1]:[W1204 13:44:15.018982156 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4288239Z [rank2]:[W1204 13:44:15.333548077 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4288412Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4288667Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4288831Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4289200Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4289405Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4289510Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4289608Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4289702Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4289704Z 2025-12-04T13:44:26.4289937Z [rank2]:[W1204 13:44:15.335917815 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4290107Z [rank3]:[W1204 13:44:15.355237082 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4290282Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4290538Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4290700Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4291096Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4291297Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4291422Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4291535Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4291643Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4291645Z 2025-12-04T13:44:26.4291878Z [rank3]:[W1204 13:44:15.357173779 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4292050Z [rank1]:[W1204 13:44:16.019153782 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4292226Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4292481Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4292645Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4293012Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4293213Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4293317Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4293416Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4293512Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4293515Z 2025-12-04T13:44:26.4293748Z [rank1]:[W1204 13:44:16.021094929 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4293921Z [rank2]:[W1204 13:44:16.336112911 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4294096Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4294349Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4294514Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4294881Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4295093Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4295196Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4295305Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4295412Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4295425Z 2025-12-04T13:44:26.4295659Z [rank2]:[W1204 13:44:16.338397551 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4295827Z [rank3]:[W1204 13:44:16.357336576 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4296004Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4296261Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4296424Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4296793Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4296994Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4297098Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4297193Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4297289Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4297292Z 2025-12-04T13:44:26.4297560Z [rank3]:[W1204 13:44:16.359675425 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4297731Z [rank1]:[W1204 13:44:17.021263535 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4297906Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4298161Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4298326Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4298695Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4298913Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4299017Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4299112Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4299225Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4299228Z 2025-12-04T13:44:26.4299489Z [rank1]:[W1204 13:44:17.023131434 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4299661Z [rank2]:[W1204 13:44:17.338583046 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4299834Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4300090Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4300251Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4300624Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4300826Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4300930Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4301025Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4301124Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4301127Z 2025-12-04T13:44:26.4301361Z [rank2]:[W1204 13:44:17.340896665 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4301531Z [rank3]:[W1204 13:44:17.359835310 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4301705Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4301960Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4302123Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4302491Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4302696Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4302810Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4302905Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4303000Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4303002Z 2025-12-04T13:44:26.4303245Z [rank3]:[W1204 13:44:17.362201788 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4303311Z PASSED [4.8501s] [ 86%] 2025-12-04T13:44:26.4303640Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_layout_optimizations_training [rank1]:[W1204 13:44:18.023247631 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4303816Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4304071Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4304232Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4304601Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4304801Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4304907Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4305002Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4305096Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4305099Z 2025-12-04T13:44:26.4305332Z [rank1]:[W1204 13:44:18.025072031 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4305502Z [rank2]:[W1204 13:44:18.341072020 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4305677Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4305933Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4306095Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4306464Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4306667Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4306782Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4306877Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4306974Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4306992Z 2025-12-04T13:44:26.4307233Z [rank2]:[W1204 13:44:18.343334501 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4307415Z [rank3]:[W1204 13:44:18.362345334 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4307644Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4307901Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4308064Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4308434Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4308636Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4308739Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4308835Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4308930Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4308931Z 2025-12-04T13:44:26.4309165Z [rank3]:[W1204 13:44:18.364629034 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4309341Z [rank1]:[W1204 13:44:19.025234557 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4309515Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4309770Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4309930Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4310296Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4310497Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4310601Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4310710Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4310806Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4310807Z 2025-12-04T13:44:26.4311039Z [rank1]:[W1204 13:44:19.027325301 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4311249Z [rank2]:[W1204 13:44:19.343500806 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4311424Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4311680Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4311845Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4312216Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4312420Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4312524Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4312619Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4312716Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4312718Z 2025-12-04T13:44:26.4312948Z [rank2]:[W1204 13:44:19.345489713 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4313119Z [rank3]:[W1204 13:44:19.364799150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4313294Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4313552Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4313716Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4314082Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4314286Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4314389Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4314484Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4314579Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4314592Z 2025-12-04T13:44:26.4314826Z [rank3]:[W1204 13:44:19.367110469 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4314879Z PASSED [1.8564s] [ 88%] 2025-12-04T13:44:26.4315189Z distributed/test_dynamo_distributed.py::TestSingleProc::test_graph_split_inductor_transpose [rank1]:[W1204 13:44:20.027471017 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4315376Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4315630Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4315794Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4316163Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4316368Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4316472Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4316567Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4316664Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4316666Z 2025-12-04T13:44:26.4316898Z [rank1]:[W1204 13:44:20.028726119 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4317070Z [rank2]:[W1204 13:44:20.345677028 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4317244Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4317545Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4317711Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4318080Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4318284Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4318386Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4318482Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4318577Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4318592Z 2025-12-04T13:44:26.4318825Z [rank2]:[W1204 13:44:20.348095975 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4319013Z [rank3]:[W1204 13:44:20.367263405 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4319215Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4319471Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4319633Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4319999Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4320202Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4320309Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4320403Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4320499Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4320501Z 2025-12-04T13:44:26.4320734Z [rank3]:[W1204 13:44:20.369040416 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4320903Z [rank1]:[W1204 13:44:21.028884895 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4321079Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4321333Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4321495Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4321861Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4322065Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4322171Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4322267Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4322363Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4322365Z 2025-12-04T13:44:26.4322607Z [rank1]:[W1204 13:44:21.030905851 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4322778Z [rank2]:[W1204 13:44:21.348288130 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4322966Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4323245Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4323408Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4323776Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4323981Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4324086Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4324182Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4324278Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4324280Z 2025-12-04T13:44:26.4324518Z [rank2]:[W1204 13:44:21.350651138 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4324689Z [rank3]:[W1204 13:44:21.369200412 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4324865Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4325123Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4325284Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4325652Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4325852Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4325958Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4326054Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4326150Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4326152Z 2025-12-04T13:44:26.4326384Z [rank3]:[W1204 13:44:21.371318775 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4326435Z PASSED [2.2163s] [ 91%] 2025-12-04T13:44:26.4326715Z distributed/test_dynamo_distributed.py::TestSingleProc::test_higher_order_op [rank1]:[W1204 13:44:22.031089366 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4326900Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4327174Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4327338Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4327741Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4327943Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4328049Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4328144Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4328238Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4328240Z 2025-12-04T13:44:26.4328475Z [rank1]:[W1204 13:44:22.032703161 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4328516Z PASSED [0.1625s] [ 94%] 2025-12-04T13:44:26.4328648Z distributed/test_dynamo_distributed.py::TestSingleProc::test_ignored_parameters PASSED [0.0954s] [ 97%] 2025-12-04T13:44:26.4328915Z distributed/test_dynamo_distributed.py::TestSingleProc::test_no_split [rank2]:[W1204 13:44:22.350841703 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4329094Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4329350Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4329514Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4329881Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4330085Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4330192Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4330286Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4330383Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4330403Z 2025-12-04T13:44:26.4330636Z [rank2]:[W1204 13:44:22.353077894 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4330806Z [rank3]:[W1204 13:44:22.371454502 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4331020Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4331278Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4331442Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4331807Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4332011Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4332115Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4332209Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4332304Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4332306Z 2025-12-04T13:44:26.4332538Z [rank3]:[W1204 13:44:22.373535246 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4332578Z PASSED [0.0887s] [100%] 2025-12-04T13:44:26.4332580Z 2025-12-04T13:44:26.4332832Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-fe62af45fb6188fc.xml - 2025-12-04T13:44:26.4332915Z ========== 30 passed, 6 skipped, 26 deselected in 1173.09s (0:19:33) =========== 2025-12-04T13:44:26.4333087Z [rank1]:[W1204 13:44:23.032853837 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4333260Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4333520Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4333682Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4334053Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4334257Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4334362Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4334472Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4334569Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4334571Z 2025-12-04T13:44:26.4334815Z [rank1]:[W1204 13:44:23.034680717 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4335007Z [rank2]:[W1204 13:44:23.353258690 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4335182Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4335437Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4335600Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4335971Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4336175Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4336279Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4336375Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4336472Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4336476Z 2025-12-04T13:44:26.4336709Z [rank2]:[W1204 13:44:23.355592109 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4336882Z [rank3]:[W1204 13:44:23.373666403 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4337054Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4337311Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4337522Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4337892Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4338094Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4338197Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4338292Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4338405Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4338407Z 2025-12-04T13:44:26.4338640Z [rank3]:[W1204 13:44:23.375941523 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4338836Z [rank1]:[W1204 13:44:24.034845432 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4339026Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4339281Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4339443Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4339811Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4340014Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4340119Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4340213Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4340310Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4340312Z 2025-12-04T13:44:26.4340546Z [rank1]:[W1204 13:44:24.036916527 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4340716Z [rank2]:[W1204 13:44:24.355748085 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33744, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4340891Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4341145Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x78d52a585b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4341308Z frame #1: + 0x6eb6c0e (0x78d47cd55c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4341675Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x78d47cd520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4341878Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x78d46cd2eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4341982Z frame #4: + 0xdc253 (0x78d42d848253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4342078Z frame #5: + 0x94ac3 (0x78d53d26dac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4342176Z frame #6: + 0x1268c0 (0x78d53d2ff8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4342178Z 2025-12-04T13:44:26.4342421Z [rank2]:[W1204 13:44:24.358146252 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4342591Z [rank3]:[W1204 13:44:24.376077559 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=16, addr=[localhost]:33746, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4342795Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4343051Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x719dfdd85b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4343212Z frame #1: + 0x6eb6c0e (0x719d56155c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4343579Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x719d561520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4343783Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x719d4612eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4343887Z frame #4: + 0xdc253 (0x719d06c48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4343982Z frame #5: + 0x94ac3 (0x719e15e84ac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4344077Z frame #6: + 0x1268c0 (0x719e15f168c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4344079Z 2025-12-04T13:44:26.4344314Z [rank3]:[W1204 13:44:24.378292451 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4344485Z [rank1]:[W1204 13:44:25.037032614 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=14, addr=[localhost]:33750, remote=[localhost]:6789): Broken pipe 2025-12-04T13:44:26.4344661Z Exception raised from sendBytes at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first): 2025-12-04T13:44:26.4344919Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) + 0xac (0x7e1132985b7c in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so) 2025-12-04T13:44:26.4345080Z frame #1: + 0x6eb6c0e (0x7e1088355c0e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4345447Z frame #2: c10d::TCPStore::check(std::vector, std::allocator >, std::allocator, std::allocator > > > const&) + 0x1b8 (0x7e10883520e8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) 2025-12-04T13:44:26.4345649Z frame #3: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c5 (0x7e107832eee5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_hip.so) 2025-12-04T13:44:26.4345755Z frame #4: + 0xdc253 (0x7e1038e48253 in /lib/x86_64-linux-gnu/libstdc++.so.6) 2025-12-04T13:44:26.4345850Z frame #5: + 0x94ac3 (0x7e114807cac3 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4345945Z frame #6: + 0x1268c0 (0x7e114810e8c0 in /lib/x86_64-linux-gnu/libc.so.6) 2025-12-04T13:44:26.4345947Z 2025-12-04T13:44:26.4346191Z [rank1]:[W1204 13:44:25.038661278 ProcessGroupNCCL.cpp:1802] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe 2025-12-04T13:44:26.4346382Z The following tests failed and then succeeded when run in a new process['test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor'] 2025-12-04T13:44:26.4346555Z The following tests failed consistently: ['test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager'] 2025-12-04T13:44:26.4346567Z 2025-12-04T13:44:26.4346781Z FINISHED PRINTING LOG FILE of distributed/test_dynamo_distributed 1/1 (test/test-reports/distributed.test_dynamo_distributed_1.1_47fb19e1d47c844a_.log) 2025-12-04T13:44:26.4346783Z 2025-12-04T13:44:26.4346917Z Finished distributed/test_dynamo_distributed 1/1 ... [2025-12-04 13:44:25.449232][2261620.385465173], took 56.40min 2025-12-04T13:44:26.4347180Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T13:44:27.6001429Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T13:44:27.6002099Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T13:44:27.6002606Z Uploading artifacts took 0.00 seconds 2025-12-04T13:44:27.6003021Z distributed/test_dynamo_distributed 1/1 failed! 2025-12-04T13:44:27.6003634Z Running distributed/tensor/test_op_schema 1/1 ... [2025-12-04 13:44:27.600024][2261622.5362707] 2025-12-04T13:44:27.6004194Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T13:44:27.6005316Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_op_schema.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 13:44:27.600299] 2025-12-04T13:44:29.7682221Z 2025-12-04T13:44:29.7683750Z distributed/tensor/test_op_schema 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_op_schema_1.1_fee28d4139592ccb_.log 2025-12-04T13:44:29.7685441Z Running 2 items in this shard: test/distributed/tensor/test_op_schema.py::TestOpSchema::test_equality_checks_lists_of_dtensor_spec, test/distributed/tensor/test_op_schema.py::TestOpSchema::test_equality_respects_static_attributes 2025-12-04T13:44:29.7686410Z 2025-12-04T13:44:29.7686768Z Finished distributed/tensor/test_op_schema 1/1 ... [2025-12-04 13:44:29.768081][2261624.704327719], took 0.04min 2025-12-04T13:44:29.7688189Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T13:44:29.7701584Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T13:44:29.7702435Z Running distributed/checkpoint/test_nested_dict 1/1 ... [2025-12-04 13:44:29.770108][2261624.706358114] 2025-12-04T13:44:29.7702910Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T13:44:29.7704828Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_nested_dict.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 13:44:29.770318] 2025-12-04T13:44:32.1380644Z 2025-12-04T13:44:32.1381579Z distributed/checkpoint/test_nested_dict 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_nested_dict_1.1_d73d948037fd8216_.log 2025-12-04T13:44:32.1382626Z Running 2 items in this shard: test/distributed/checkpoint/test_nested_dict.py::TestFlattening::test_flattening_round_trip, test/distributed/checkpoint/test_nested_dict.py::TestFlattening::test_mapping 2025-12-04T13:44:32.1383155Z 2025-12-04T13:44:32.1383409Z Finished distributed/checkpoint/test_nested_dict 1/1 ... [2025-12-04 13:44:32.137783][2261627.074032022], took 0.04min 2025-12-04T13:44:32.1384741Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T13:44:32.1392701Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T13:44:32.1394523Z Running distributed/checkpoint/test_consolidate_hf_safetensors 1/1 ... [2025-12-04 13:44:32.139363][2261627.075614507] 2025-12-04T13:44:32.1394936Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T13:44:32.1396653Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_consolidate_hf_safetensors.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 13:44:32.139554] 2025-12-04T13:57:01.1396443Z 2025-12-04T13:57:01.1397899Z distributed/checkpoint/test_consolidate_hf_safetensors 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_consolidate_hf_safetensors_1.1_876d4e0e7322f100_.log 2025-12-04T13:57:01.1402695Z Running 7 items in this shard: test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_calculate_max_contiguous_elements_valid_cases, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_calculate_max_contiguous_elements_validations, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_consolidate_one_file_with_two_ranks, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_consolidate_to_one_file, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_consolidate_to_two_files, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_consolidate_with_two_ranks, test/distributed/checkpoint/test_consolidate_hf_safetensors.py::TestConsolidateHFSafeTensors::test_write_sub_tensor_to_file_optimized 2025-12-04T13:57:01.1406144Z 2025-12-04T13:57:01.1406422Z Finished distributed/checkpoint/test_consolidate_hf_safetensors 1/1 ... [2025-12-04 13:57:01.139319][2262376.075562773], took 12.48min 2025-12-04T13:57:01.1407147Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T13:57:01.1415867Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T13:57:01.1419015Z Running distributed/tensor/test_dtensor_compile 3/4 ... [2025-12-04 13:57:01.141808][2262376.078058777] 2025-12-04T13:57:01.1419290Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T13:57:01.1421246Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_dtensor_compile.py', '--shard-id=3', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 13:57:01.142026] 2025-12-04T14:34:49.2574773Z 2025-12-04T14:34:49.2575820Z PRINTING LOG FILE of distributed/tensor/test_dtensor_compile 3/4 (test/test-reports/distributed.tensor.test_dtensor_compile_3.4_a5d422218d59addd_.log) 2025-12-04T14:34:49.2576873Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-20c3b8b1d58be378.xml 2025-12-04T14:34:49.2577704Z ============================= test session starts ============================== 2025-12-04T14:34:49.2578187Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.2578602Z cachedir: .pytest_cache 2025-12-04T14:34:49.2579090Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.2579603Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.2580349Z configfile: pytest.ini 2025-12-04T14:34:49.2580829Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.2581330Z collecting ... collected 49 items 2025-12-04T14:34:49.2581661Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T14:34:49.2585789Z Running 14 items in this shard: test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_basic_export, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_loss_parallel_log_softmax, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_recompiles, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_slice, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_noncontiguous_output, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_partial_placement_redistribute_unbalanced_correct_strides, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dynamo_dtensor_from_local_dynamic_shapes, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dynamo_to_local_grad_placements_sequence, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_get_local_rank_compile, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_tp_compile_comm_reordering, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_compile_embedding_redistribute, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_False, test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True 2025-12-04T14:34:49.2590012Z 2025-12-04T14:34:49.2590524Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_basic_export PASSED [0.2658s] [ 7%] 2025-12-04T14:34:49.2591132Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_loss_parallel_log_softmax PASSED [0.3849s] [ 14%] 2025-12-04T14:34:49.2591725Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_recompiles PASSED [0.3069s] [ 21%] 2025-12-04T14:34:49.2592319Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_dynamic_slice SKIPPED [0.0002s] (DTensor + dynamic fails - s77 + 8 is not tracked with proxy .. proxy_tensor.PythonKeyTracer) [ 28%] 2025-12-04T14:34:49.2592907Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_noncontiguous_output PASSED [121.8306s] [ 35%] 2025-12-04T14:34:49.2593425Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dtensor_partial_placement_redistribute_unbalanced_correct_strides PASSED [0.0470s] [ 42%] 2025-12-04T14:34:49.2593964Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dynamo_dtensor_from_local_dynamic_shapes PASSED [0.3289s] [ 50%] 2025-12-04T14:34:49.2594443Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_dynamo_to_local_grad_placements_sequence PASSED [0.0406s] [ 57%] 2025-12-04T14:34:49.2594903Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_get_local_rank_compile PASSED [0.0872s] [ 64%] 2025-12-04T14:34:49.2595336Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompile::test_tp_compile_comm_reordering PASSED [1.2896s] [ 71%] 2025-12-04T14:34:49.2595976Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True I1204 13:59:10.360000 107915 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 108641 2025-12-04T14:34:49.2596608Z I1204 13:59:10.361000 107915 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 108642 2025-12-04T14:34:49.2597096Z I1204 13:59:10.361000 107915 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 108643 2025-12-04T14:34:49.2597582Z I1204 13:59:10.362000 107915 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 108644 2025-12-04T14:34:49.2598136Z [rank2]:W1204 14:02:58.112000 108643 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2598765Z [rank0]:W1204 14:02:58.112000 108641 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2599343Z [rank1]:W1204 14:02:58.445000 108642 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2599887Z [rank0]:I1204 14:04:10.366000 108641 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T14:34:49.2600404Z [rank1]:I1204 14:04:10.366000 108642 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T14:34:49.2600914Z [rank2]:I1204 14:04:10.366000 108643 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T14:34:49.2601399Z [rank0]:I1204 14:04:10.367000 108641 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T14:34:49.2601821Z [rank1]:I1204 14:04:10.367000 108642 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T14:34:49.2602267Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T14:34:49.2602816Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2603156Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073d3db7fd640 (most recent call first): 2025-12-04T14:34:49.2603511Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2603803Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2604135Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073d3dcbff640 (most recent call first): 2025-12-04T14:34:49.2604489Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2604777Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2605105Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073d3dadfc640 (most recent call first): 2025-12-04T14:34:49.2605457Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2605742Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2606072Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073d3dc1fe640 (most recent call first): 2025-12-04T14:34:49.2606428Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2606717Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2607049Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073d47c9ff640 (most recent call first): 2025-12-04T14:34:49.2607552Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2608020Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2608522Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2609056Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2609546Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2609925Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2610268Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000073d5ca782640 (most recent call first): 2025-12-04T14:34:49.2610802Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2611350Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2611818Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2612293Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2612658Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2612981Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000073f7cb152740 (most recent call first): 2025-12-04T14:34:49.2613483Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1087 in _compare_regular_values_close 2025-12-04T14:34:49.2614080Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 905 in _compare_values 2025-12-04T14:34:49.2614642Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 747 in compare 2025-12-04T14:34:49.2615216Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1298 in not_close_error_metas 2025-12-04T14:34:49.2615809Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4245 in assertEqual 2025-12-04T14:34:49.2616408Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1397 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2617020Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2617664Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2618331Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2618937Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2619531Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2620127Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2620723Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2621263Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2621760Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2622259Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2622752Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2623187Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2623498Z E1204 14:04:10.367000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2623798Z [rank2]:I1204 14:04:10.367000 108643 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T14:34:49.2624146Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T14:34:49.2624453Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2624773Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078d1c97fc640 (most recent call first): 2025-12-04T14:34:49.2625119Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2625403Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2625724Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078d1cabfe640 (most recent call first): 2025-12-04T14:34:49.2626068Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2626346Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2626685Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078d1cb5ff640 (most recent call first): 2025-12-04T14:34:49.2627036Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2627337Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2627744Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078d1ca1fd640 (most recent call first): 2025-12-04T14:34:49.2628089Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2628370Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2628692Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078d26b1ff640 (most recent call first): 2025-12-04T14:34:49.2629110Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2629560Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2630034Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2630524Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2630998Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2631361Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2631696Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000078d3b9082640 (most recent call first): 2025-12-04T14:34:49.2632214Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2632749Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2633218Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2633695Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2634055Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2634380Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000078f5b8f92740 (most recent call first): 2025-12-04T14:34:49.2634884Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1087 in _compare_regular_values_close 2025-12-04T14:34:49.2635475Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 905 in _compare_values 2025-12-04T14:34:49.2636053Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 747 in compare 2025-12-04T14:34:49.2636620Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1298 in not_close_error_metas 2025-12-04T14:34:49.2637251Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4245 in assertEqual 2025-12-04T14:34:49.2637880Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1397 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2638478Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2639097Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2639726Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2640324Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2640909Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2641504Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2642098Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2642632Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2643128Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2643626Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2644121Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2644551Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2644856Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2645163Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T14:34:49.2645482Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2645802Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000756141ffe640 (most recent call first): 2025-12-04T14:34:49.2646165Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2646462Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2646798Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000756140bfc640 (most recent call first): 2025-12-04T14:34:49.2647149Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2647430Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2647776Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007561429ff640 (most recent call first): 2025-12-04T14:34:49.2648121Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2648401Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2648722Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007561415fd640 (most recent call first): 2025-12-04T14:34:49.2649066Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2649346Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2649668Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007562febfd640 (most recent call first): 2025-12-04T14:34:49.2650090Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2650544Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2651017Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2651507Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2651984Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2652347Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2652679Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000756330681640 (most recent call first): 2025-12-04T14:34:49.2653193Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2653730Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2654195Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2654689Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2655053Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2655395Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000075852f3a3740 (most recent call first): 2025-12-04T14:34:49.2655924Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1087 in _compare_regular_values_close 2025-12-04T14:34:49.2656516Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 905 in _compare_values 2025-12-04T14:34:49.2657076Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 747 in compare 2025-12-04T14:34:49.2657688Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1298 in not_close_error_metas 2025-12-04T14:34:49.2658283Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4245 in assertEqual 2025-12-04T14:34:49.2658882Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1397 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2659482Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2660100Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2660729Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2661324Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2661908Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2662501Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2663094Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2663628Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2664147Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2664640Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2665149Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2665605Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2665913Z E1204 14:04:10.368000 107915 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2666265Z E1204 14:04:15.370000 107915 site-packages/torch/testing/_internal/common_distributed.py:992] Could not retrieve traceback for timed out process: 3 2025-12-04T14:34:49.2666536Z FAILED [305.0182s] [ 78%] 2025-12-04T14:34:49.2666616Z 2025-12-04T14:34:49.2666679Z =================================== FAILURES =================================== 2025-12-04T14:34:49.2666882Z _________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile_use_ca_True _________ 2025-12-04T14:34:49.2667071Z Traceback (most recent call last): 2025-12-04T14:34:49.2667328Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T14:34:49.2667627Z self._join_processes(fn) 2025-12-04T14:34:49.2667885Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T14:34:49.2668156Z self._check_return_codes(fn, elapsed_time) 2025-12-04T14:34:49.2668433Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T14:34:49.2668698Z raise RuntimeError( 2025-12-04T14:34:49.2668869Z RuntimeError: Process 0 terminated or timed out after 305.00919699668884 seconds 2025-12-04T14:34:49.2669085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T14:34:49.2669274Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T14:34:49.2669644Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-20c3b8b1d58be378.xml - 2025-12-04T14:34:49.2670012Z =========================== short test summary info ============================ 2025-12-04T14:34:49.2670382Z FAILED [305.0182s] distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True - RuntimeError: Process 0 terminated or timed out after 305.00919699668884 seconds 2025-12-04T14:34:49.2670744Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T14:34:49.2670925Z ============== 1 failed, 9 passed, 1 skipped in 429.62s (0:07:09) ============== 2025-12-04T14:34:49.2671278Z /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown 2025-12-04T14:34:49.2671626Z warnings.warn('resource_tracker: There appear to be %d ' 2025-12-04T14:34:49.2671779Z Got exit code 1 2025-12-04T14:34:49.2671888Z Retrying single test... 2025-12-04T14:34:49.2672183Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-fb7d841820d58eb8.xml 2025-12-04T14:34:49.2672503Z ============================= test session starts ============================== 2025-12-04T14:34:49.2672725Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.2672922Z cachedir: .pytest_cache 2025-12-04T14:34:49.2673174Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.2673422Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.2673549Z configfile: pytest.ini 2025-12-04T14:34:49.2673786Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.2674084Z collecting ... collected 49 items / 13 deselected / 36 selected 2025-12-04T14:34:49.2674452Z stepcurrent: skipping 10 already run items. Running only test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True 2025-12-04T14:34:49.2674759Z Running 1 items in this shard 2025-12-04T14:34:49.2674835Z 2025-12-04T14:34:49.2675139Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True I1204 14:04:22.283000 111356 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 112012 2025-12-04T14:34:49.2675635Z I1204 14:04:22.284000 111356 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 112013 2025-12-04T14:34:49.2675985Z I1204 14:04:22.284000 111356 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 112014 2025-12-04T14:34:49.2676332Z I1204 14:04:22.285000 111356 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 112015 2025-12-04T14:34:49.2676685Z [rank3]:[W1204 14:04:31.589508459 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.2677048Z [rank3]:[W1204 14:04:31.589842002 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2677255Z 2025-12-04T14:34:49.2677418Z [rank3]:[W1204 14:04:39.281358073 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2677659Z 2025-12-04T14:34:49.2677824Z [rank3]:[W1204 14:04:39.299887009 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2678028Z 2025-12-04T14:34:49.2678189Z [rank3]:[W1204 14:04:39.302883782 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2678393Z 2025-12-04T14:34:49.2678557Z [rank3]:[W1204 14:04:39.303053318 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2678763Z 2025-12-04T14:34:49.2678924Z [rank3]:[W1204 14:04:39.303305232 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2679126Z 2025-12-04T14:34:49.2679291Z [rank3]:[W1204 14:04:39.334520405 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2679490Z 2025-12-04T14:34:49.2679658Z [rank3]:[W1204 14:04:39.334808168 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2679856Z 2025-12-04T14:34:49.2680021Z [rank3]:[W1204 14:04:39.335170860 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2680221Z 2025-12-04T14:34:49.2680387Z [rank3]:[W1204 14:04:39.335531762 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2680587Z 2025-12-04T14:34:49.2680754Z [rank3]:[W1204 14:04:39.335643590 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2680955Z 2025-12-04T14:34:49.2681116Z [rank3]:[W1204 14:04:39.335838955 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2681319Z 2025-12-04T14:34:49.2681492Z [rank1]:[W1204 14:06:55.890056909 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.2681846Z [rank1]:[W1204 14:06:55.890120058 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2682065Z 2025-12-04T14:34:49.2682226Z [rank1]:[W1204 14:07:03.170081395 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2682455Z 2025-12-04T14:34:49.2682617Z [rank1]:[W1204 14:07:03.187951177 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2682820Z 2025-12-04T14:34:49.2682981Z [rank1]:[W1204 14:07:03.191055118 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2683184Z 2025-12-04T14:34:49.2683350Z [rank1]:[W1204 14:07:03.191219404 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2683549Z 2025-12-04T14:34:49.2683716Z [rank1]:[W1204 14:07:03.191486938 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2683916Z 2025-12-04T14:34:49.2684084Z [rank1]:[W1204 14:07:03.219054484 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2684284Z 2025-12-04T14:34:49.2684450Z [rank1]:[W1204 14:07:03.219353257 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2684650Z 2025-12-04T14:34:49.2684815Z [rank1]:[W1204 14:07:03.219714689 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2685013Z 2025-12-04T14:34:49.2685178Z [rank1]:[W1204 14:07:03.220100090 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2685381Z 2025-12-04T14:34:49.2685542Z [rank1]:[W1204 14:07:03.220225077 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2685745Z 2025-12-04T14:34:49.2685907Z [rank1]:[W1204 14:07:03.220433903 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2686114Z 2025-12-04T14:34:49.2686326Z [rank1]:W1204 14:07:04.613000 112013 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2686779Z [rank0]:W1204 14:07:57.202000 112012 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2687230Z [rank3]:W1204 14:08:04.523000 112015 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2687719Z [rank2]:W1204 14:08:08.848000 112014 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2688127Z [rank3]:[W1204 14:08:52.664931034 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2688329Z 2025-12-04T14:34:49.2688495Z [rank3]:[W1204 14:08:52.669504432 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2688694Z 2025-12-04T14:34:49.2688860Z [rank3]:[W1204 14:08:52.669852404 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2689059Z 2025-12-04T14:34:49.2689245Z [rank3]:[W1204 14:08:52.670097539 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2689444Z 2025-12-04T14:34:49.2689609Z [rank3]:[W1204 14:08:52.670315754 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2689825Z 2025-12-04T14:34:49.2689999Z [rank3]:[W1204 14:08:52.670521130 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2690217Z 2025-12-04T14:34:49.2690379Z [rank3]:[W1204 14:08:52.670721605 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2690580Z 2025-12-04T14:34:49.2690743Z [rank3]:[W1204 14:08:52.670921031 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2690946Z 2025-12-04T14:34:49.2691107Z [rank3]:[W1204 14:08:52.670987369 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2691309Z 2025-12-04T14:34:49.2691494Z [rank1]:I1204 14:09:22.303000 112013 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T14:34:49.2691904Z [rank2]:I1204 14:09:22.303000 112014 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T14:34:49.2692311Z [rank0]:I1204 14:09:22.303000 112012 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T14:34:49.2692698Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T14:34:49.2693008Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2693336Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000717ce75fd640 (most recent call first): 2025-12-04T14:34:49.2693691Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2693975Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2694303Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000717ce6bfc640 (most recent call first): 2025-12-04T14:34:49.2694649Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2694930Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2695254Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000717ce7ffe640 (most recent call first): 2025-12-04T14:34:49.2695598Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2695878Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2696199Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000717ce89ff640 (most recent call first): 2025-12-04T14:34:49.2696653Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 836 in __call__ 2025-12-04T14:34:49.2697226Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 684 in __torch_dispatch__ 2025-12-04T14:34:49.2697734Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2698058Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000717d887ff640 (most recent call first): 2025-12-04T14:34:49.2698481Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2698977Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2699450Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2699939Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2700422Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2700784Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2701119Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000717ed6782640 (most recent call first): 2025-12-04T14:34:49.2701635Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2702175Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2702642Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2703115Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2703478Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2703803Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000071a0d64f2740 (most recent call first): 2025-12-04T14:34:49.2704284Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.2704846Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.2705372Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.2705931Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1400 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2706534Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2707168Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2707849Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2708472Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2709069Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2709661Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2710251Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2710793Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2711292Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2711787Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2712280Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2712717Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2713026Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2713336Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T14:34:49.2713642Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2713966Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a0eafbfc640 (most recent call first): 2025-12-04T14:34:49.2714315Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2714597Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2714919Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a0eb05fd640 (most recent call first): 2025-12-04T14:34:49.2715265Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2715548Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2715868Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a0eb0ffe640 (most recent call first): 2025-12-04T14:34:49.2716331Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 836 in __call__ 2025-12-04T14:34:49.2716899Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 684 in __torch_dispatch__ 2025-12-04T14:34:49.2717357Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2717731Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a0eb19ff640 (most recent call first): 2025-12-04T14:34:49.2718095Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2718375Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2718695Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a309f5ff640 (most recent call first): 2025-12-04T14:34:49.2719112Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2719565Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2720040Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2720528Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2721008Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2721369Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2721703Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007a107e782640 (most recent call first): 2025-12-04T14:34:49.2722220Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2722754Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2723221Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2723697Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2724057Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2724379Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a327f46b740 (most recent call first): 2025-12-04T14:34:49.2724861Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.2725418Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.2725956Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.2726514Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1400 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2727147Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2727809Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2728435Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2729032Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2729619Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2730215Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2730806Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2731346Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2731845Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2732343Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2732840Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2733275Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2733583Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2733882Z [rank0]:I1204 14:09:22.305000 112012 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T14:34:49.2734223Z [rank2]:I1204 14:09:22.305000 112014 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T14:34:49.2734559Z [rank1]:I1204 14:09:22.305000 112013 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T14:34:49.2734905Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T14:34:49.2735211Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2735551Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c53643fd640 (most recent call first): 2025-12-04T14:34:49.2736000Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 836 in __call__ 2025-12-04T14:34:49.2736612Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 684 in __torch_dispatch__ 2025-12-04T14:34:49.2737071Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2737393Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c53639fc640 (most recent call first): 2025-12-04T14:34:49.2737784Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2738064Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2738384Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c53657ff640 (most recent call first): 2025-12-04T14:34:49.2738733Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2739017Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2739337Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c5364dfe640 (most recent call first): 2025-12-04T14:34:49.2739683Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2739963Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2740281Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c55217fb640 (most recent call first): 2025-12-04T14:34:49.2740699Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2741157Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2741632Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2742120Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2742597Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2742958Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2743293Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007c6d5cefe640 (most recent call first): 2025-12-04T14:34:49.2743806Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2744339Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2744823Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2745296Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2745682Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2746016Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c775355d740 (most recent call first): 2025-12-04T14:34:49.2746496Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.2747052Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.2747702Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.2748268Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1400 in test_2d_fsdp_tp_ac_compile 2025-12-04T14:34:49.2748873Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T14:34:49.2749496Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2750121Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2750714Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2751300Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2751895Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2752487Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2753024Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2753525Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2754028Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2754539Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2754969Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2755289Z E1204 14:09:22.305000 111356 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2755650Z E1204 14:09:27.310000 111356 site-packages/torch/testing/_internal/common_distributed.py:992] Could not retrieve traceback for timed out process: 3 2025-12-04T14:34:49.2755916Z FAILED [305.0313s] [100%] 2025-12-04T14:34:49.2755985Z 2025-12-04T14:34:49.2756049Z =================================== FAILURES =================================== 2025-12-04T14:34:49.2756248Z _________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile_use_ca_True _________ 2025-12-04T14:34:49.2756440Z Traceback (most recent call last): 2025-12-04T14:34:49.2756694Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T14:34:49.2756948Z self._join_processes(fn) 2025-12-04T14:34:49.2757201Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T14:34:49.2757516Z self._check_return_codes(fn, elapsed_time) 2025-12-04T14:34:49.2757797Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T14:34:49.2758061Z raise RuntimeError( 2025-12-04T14:34:49.2758228Z RuntimeError: Process 0 terminated or timed out after 305.0268294811249 seconds 2025-12-04T14:34:49.2758443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T14:34:49.2758628Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T14:34:49.2758998Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-fb7d841820d58eb8.xml - 2025-12-04T14:34:49.2759359Z =========================== short test summary info ============================ 2025-12-04T14:34:49.2759723Z FAILED [305.0313s] distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True - RuntimeError: Process 0 terminated or timed out after 305.0268294811249 seconds 2025-12-04T14:34:49.2760082Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T14:34:49.2760259Z ================= 1 failed, 13 deselected in 305.04s (0:05:05) ================= 2025-12-04T14:34:49.2760607Z /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown 2025-12-04T14:34:49.2760951Z warnings.warn('resource_tracker: There appear to be %d ' 2025-12-04T14:34:49.2761101Z Got exit code 1 2025-12-04T14:34:49.2761205Z Retrying single test... 2025-12-04T14:34:49.2761494Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-60d0b26edbeb4690.xml 2025-12-04T14:34:49.2761810Z ============================= test session starts ============================== 2025-12-04T14:34:49.2762028Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.2762225Z cachedir: .pytest_cache 2025-12-04T14:34:49.2762455Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.2762703Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.2762829Z configfile: pytest.ini 2025-12-04T14:34:49.2763082Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.2763365Z collecting ... collected 49 items / 13 deselected / 36 selected 2025-12-04T14:34:49.2763707Z stepcurrent: skipping 10 already run items. Running only test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True 2025-12-04T14:34:49.2764028Z Running 1 items in this shard 2025-12-04T14:34:49.2764108Z 2025-12-04T14:34:49.2764435Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True I1204 14:09:32.967000 114744 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 115400 2025-12-04T14:34:49.2764929Z I1204 14:09:32.968000 114744 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 115401 2025-12-04T14:34:49.2765281Z I1204 14:09:32.969000 114744 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 115402 2025-12-04T14:34:49.2765632Z I1204 14:09:32.969000 114744 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 115403 2025-12-04T14:34:49.2765983Z [rank3]:[W1204 14:09:47.219250601 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.2766345Z [rank3]:[W1204 14:09:47.219306010 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2766556Z 2025-12-04T14:34:49.2766720Z [rank3]:[W1204 14:09:55.019116611 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2766926Z 2025-12-04T14:34:49.2767091Z [rank3]:[W1204 14:09:55.353385075 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2767290Z 2025-12-04T14:34:49.2767462Z [rank3]:[W1204 14:09:55.356763650 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2767704Z 2025-12-04T14:34:49.2767869Z [rank3]:[W1204 14:09:55.356962186 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2768068Z 2025-12-04T14:34:49.2768233Z [rank3]:[W1204 14:09:55.357284789 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2768433Z 2025-12-04T14:34:49.2768601Z [rank3]:[W1204 14:09:55.598333427 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2768798Z 2025-12-04T14:34:49.2768964Z [rank3]:[W1204 14:09:55.598632850 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2769167Z 2025-12-04T14:34:49.2769329Z [rank3]:[W1204 14:09:55.598997702 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2769532Z 2025-12-04T14:34:49.2769693Z [rank3]:[W1204 14:09:55.599422222 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2769898Z 2025-12-04T14:34:49.2770060Z [rank3]:[W1204 14:09:55.599544470 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2770262Z 2025-12-04T14:34:49.2770424Z [rank3]:[W1204 14:09:55.599755985 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2770625Z 2025-12-04T14:34:49.2770779Z [rank1]:[W1204 14:11:01.740420587 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.2771151Z [rank1]:[W1204 14:11:01.740467276 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2771355Z 2025-12-04T14:34:49.2771521Z [rank1]:[W1204 14:11:09.162593543 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2771718Z 2025-12-04T14:34:49.2771901Z [rank1]:[W1204 14:11:09.181664139 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2772117Z 2025-12-04T14:34:49.2772296Z [rank1]:[W1204 14:11:09.184436307 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2772495Z 2025-12-04T14:34:49.2772660Z [rank1]:[W1204 14:11:09.184593154 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2772858Z 2025-12-04T14:34:49.2773025Z [rank1]:[W1204 14:11:09.184840398 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2773227Z 2025-12-04T14:34:49.2773389Z [rank1]:[W1204 14:11:09.212609571 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2773595Z 2025-12-04T14:34:49.2773756Z [rank1]:[W1204 14:11:09.212876265 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2773959Z 2025-12-04T14:34:49.2774122Z [rank1]:[W1204 14:11:09.213235597 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2774324Z 2025-12-04T14:34:49.2774483Z [rank1]:[W1204 14:11:09.213587939 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2774684Z 2025-12-04T14:34:49.2774846Z [rank1]:[W1204 14:11:09.213698977 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2775049Z 2025-12-04T14:34:49.2775214Z [rank1]:[W1204 14:11:09.213906172 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2775413Z 2025-12-04T14:34:49.2775629Z [rank2]:W1204 14:11:10.552000 115402 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2776082Z [rank0]:W1204 14:11:10.567000 115400 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2776530Z [rank3]:W1204 14:11:14.140000 115403 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2776975Z [rank1]:W1204 14:12:12.726000 115401 site-packages/torch/_logging/_internal.py:1204] [0/0] Profiler function will be ignored 2025-12-04T14:34:49.2777379Z [rank1]:[W1204 14:12:15.364584588 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2777618Z 2025-12-04T14:34:49.2777783Z [rank1]:[W1204 14:12:15.369444900 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2777989Z 2025-12-04T14:34:49.2778152Z [rank1]:[W1204 14:12:15.369801862 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2778356Z 2025-12-04T14:34:49.2778518Z [rank1]:[W1204 14:12:15.370047467 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2778721Z 2025-12-04T14:34:49.2778897Z [rank1]:[W1204 14:12:15.370266452 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2779100Z 2025-12-04T14:34:49.2779261Z [rank1]:[W1204 14:12:15.370475447 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2779478Z 2025-12-04T14:34:49.2779639Z [rank1]:[W1204 14:12:15.370684233 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2779857Z 2025-12-04T14:34:49.2780033Z [rank1]:[W1204 14:12:15.370886648 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2780234Z 2025-12-04T14:34:49.2780399Z [rank1]:[W1204 14:12:15.370954787 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2780596Z 2025-12-04T14:34:49.2780764Z [rank0]:[W1204 14:13:33.920612687 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2780962Z 2025-12-04T14:34:49.2781315Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py:972: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T14:34:49.2781756Z actual_saved_tensors = value.saved_tensors 2025-12-04T14:34:49.2782016Z [rank0]:[W1204 14:13:41.439753283 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2782220Z 2025-12-04T14:34:49.2782384Z [rank0]:[W1204 14:13:41.440434487 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2782590Z 2025-12-04T14:34:49.2782753Z [rank0]:[W1204 14:13:41.440705011 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2782957Z 2025-12-04T14:34:49.2783119Z [rank0]:[W1204 14:13:41.440938496 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2783321Z 2025-12-04T14:34:49.2783483Z [rank0]:[W1204 14:13:41.441165161 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2783688Z 2025-12-04T14:34:49.2783849Z [rank0]:[W1204 14:13:41.441377877 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2784053Z 2025-12-04T14:34:49.2784219Z [rank0]:[W1204 14:13:41.441582162 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2784419Z 2025-12-04T14:34:49.2784584Z [rank0]:[W1204 14:13:41.441654550 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2784783Z 2025-12-04T14:34:49.2784949Z [rank3]:[W1204 14:14:17.209978427 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2785148Z 2025-12-04T14:34:49.2785315Z [rank3]:[W1204 14:14:17.214833080 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2785514Z 2025-12-04T14:34:49.2785682Z [rank3]:[W1204 14:14:17.215365998 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2785880Z 2025-12-04T14:34:49.2786045Z [rank3]:[W1204 14:14:17.215623462 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2786248Z 2025-12-04T14:34:49.2786419Z [rank3]:[W1204 14:14:17.215842287 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2786623Z 2025-12-04T14:34:49.2786784Z [rank3]:[W1204 14:14:17.216065152 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2787007Z 2025-12-04T14:34:49.2787168Z [rank3]:[W1204 14:14:17.216274688 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2787382Z 2025-12-04T14:34:49.2787602Z [rank3]:[W1204 14:14:17.216475593 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2787805Z 2025-12-04T14:34:49.2787966Z [rank3]:[W1204 14:14:17.216546822 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2788166Z 2025-12-04T14:34:49.2788331Z [rank2]:[W1204 14:14:17.299572729 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2788530Z 2025-12-04T14:34:49.2788695Z [rank0]:[W1204 14:14:17.641372012 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2788896Z 2025-12-04T14:34:49.2789062Z [rank0]:[W1204 14:14:17.651137576 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2789263Z 2025-12-04T14:34:49.2789429Z [rank0]:[W1204 14:14:17.656684882 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2789627Z 2025-12-04T14:34:49.2789792Z [rank0]:[W1204 14:14:17.662311828 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2789990Z 2025-12-04T14:34:49.2790157Z [rank0]:[W1204 14:14:17.667787496 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2790361Z 2025-12-04T14:34:49.2790521Z [rank0]:[W1204 14:14:17.672847674 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2790723Z 2025-12-04T14:34:49.2790884Z [rank0]:[W1204 14:14:17.677929521 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2791086Z 2025-12-04T14:34:49.2791246Z [rank0]:[W1204 14:14:17.683257653 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2791447Z 2025-12-04T14:34:49.2791609Z [rank0]:[W1204 14:14:17.726918124 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2791811Z 2025-12-04T14:34:49.2791974Z [rank0]:[W1204 14:14:17.733659474 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2792178Z 2025-12-04T14:34:49.2792343Z [rank0]:[W1204 14:14:17.738640023 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2792541Z 2025-12-04T14:34:49.2792705Z [rank0]:[W1204 14:14:17.743630723 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2792903Z 2025-12-04T14:34:49.2793069Z [rank0]:[W1204 14:14:17.748574233 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2793269Z 2025-12-04T14:34:49.2793434Z [rank0]:[W1204 14:14:17.753963923 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2793860Z 2025-12-04T14:34:49.2801584Z [rank0]:[W1204 14:14:17.758953883 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2801807Z 2025-12-04T14:34:49.2801978Z [rank0]:[W1204 14:14:17.763952972 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2802201Z 2025-12-04T14:34:49.2802360Z [rank0]:[W1204 14:14:17.824682404 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2802576Z 2025-12-04T14:34:49.2802751Z [rank0]:[W1204 14:14:17.827295216 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2802950Z 2025-12-04T14:34:49.2803107Z [rank0]:[W1204 14:14:17.885966604 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2803304Z 2025-12-04T14:34:49.2803465Z [rank0]:[W1204 14:14:17.886117190 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2803663Z 2025-12-04T14:34:49.2803820Z [rank0]:[W1204 14:14:17.886354375 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2804023Z 2025-12-04T14:34:49.2804179Z [rank0]:[W1204 14:14:17.886415864 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2804380Z 2025-12-04T14:34:49.2804755Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2805192Z warnings.warn( 2025-12-04T14:34:49.2805419Z [rank3]:[W1204 14:14:18.083705715 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2805619Z 2025-12-04T14:34:49.2805780Z [rank3]:[W1204 14:14:18.093458968 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2805979Z 2025-12-04T14:34:49.2806139Z [rank3]:[W1204 14:14:18.099435306 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2806336Z 2025-12-04T14:34:49.2806498Z [rank3]:[W1204 14:14:18.105572409 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2806696Z 2025-12-04T14:34:49.2806854Z [rank3]:[W1204 14:14:18.111945818 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2807052Z 2025-12-04T14:34:49.2807215Z [rank3]:[W1204 14:14:18.117501505 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2807411Z 2025-12-04T14:34:49.2807630Z [rank3]:[W1204 14:14:18.123126440 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2807828Z 2025-12-04T14:34:49.2807987Z [rank3]:[W1204 14:14:18.129044938 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2808185Z 2025-12-04T14:34:49.2808346Z [rank3]:[W1204 14:14:18.172300648 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2808545Z 2025-12-04T14:34:49.2808704Z [rank3]:[W1204 14:14:18.178969560 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2808900Z 2025-12-04T14:34:49.2809075Z [rank3]:[W1204 14:14:18.183986109 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2809271Z 2025-12-04T14:34:49.2809431Z [rank3]:[W1204 14:14:18.188991838 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2809640Z 2025-12-04T14:34:49.2809800Z [rank3]:[W1204 14:14:18.193951558 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2810011Z 2025-12-04T14:34:49.2810187Z [rank3]:[W1204 14:14:18.199312249 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2810383Z 2025-12-04T14:34:49.2810542Z [rank3]:[W1204 14:14:18.204219870 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2810741Z 2025-12-04T14:34:49.2810898Z [rank3]:[W1204 14:14:18.210438502 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2811095Z 2025-12-04T14:34:49.2811253Z [rank3]:[W1204 14:14:18.267322029 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2811450Z 2025-12-04T14:34:49.2811608Z [rank3]:[W1204 14:14:18.269640988 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2811806Z 2025-12-04T14:34:49.2811963Z [rank3]:[W1204 14:14:18.327455085 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2812160Z 2025-12-04T14:34:49.2812317Z [rank3]:[W1204 14:14:18.327539973 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2812513Z 2025-12-04T14:34:49.2812675Z [rank3]:[W1204 14:14:18.327743228 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2812871Z 2025-12-04T14:34:49.2813030Z [rank3]:[W1204 14:14:18.327804877 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2813224Z 2025-12-04T14:34:49.2813586Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2814017Z warnings.warn( 2025-12-04T14:34:49.2814238Z [rank1]:[W1204 14:14:18.920795625 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2814434Z 2025-12-04T14:34:49.2814595Z [rank1]:[W1204 14:14:18.930723955 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2814792Z 2025-12-04T14:34:49.2814951Z [rank1]:[W1204 14:14:18.936290941 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2815149Z 2025-12-04T14:34:49.2815307Z [rank1]:[W1204 14:14:19.941875937 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2815504Z 2025-12-04T14:34:49.2815663Z [rank1]:[W1204 14:14:19.947385165 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2815857Z 2025-12-04T14:34:49.2816015Z [rank1]:[W1204 14:14:19.952457883 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2816212Z 2025-12-04T14:34:49.2816382Z [rank1]:[W1204 14:14:19.957544170 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2816581Z 2025-12-04T14:34:49.2816741Z [rank1]:[W1204 14:14:19.962506960 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2816947Z 2025-12-04T14:34:49.2817107Z [rank1]:[W1204 14:14:19.004617645 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2817313Z 2025-12-04T14:34:49.2817516Z [rank1]:[W1204 14:14:19.011343406 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2817712Z 2025-12-04T14:34:49.2817871Z [rank1]:[W1204 14:14:19.016312365 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2818065Z 2025-12-04T14:34:49.2818226Z [rank1]:[W1204 14:14:19.021306845 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2818423Z 2025-12-04T14:34:49.2818585Z [rank1]:[W1204 14:14:19.026311773 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2818785Z 2025-12-04T14:34:49.2818945Z [rank1]:[W1204 14:14:19.031705114 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2819141Z 2025-12-04T14:34:49.2819300Z [rank1]:[W1204 14:14:19.036681043 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2819496Z 2025-12-04T14:34:49.2819654Z [rank1]:[W1204 14:14:19.041676972 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2819851Z 2025-12-04T14:34:49.2820010Z [rank1]:[W1204 14:14:19.098486072 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2820209Z 2025-12-04T14:34:49.2820366Z [rank1]:[W1204 14:14:19.100694423 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2820564Z 2025-12-04T14:34:49.2820723Z [rank1]:[W1204 14:14:19.159074467 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2820919Z 2025-12-04T14:34:49.2821080Z [rank1]:[W1204 14:14:19.159169545 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2821274Z 2025-12-04T14:34:49.2821436Z [rank1]:[W1204 14:14:19.159387830 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2821631Z 2025-12-04T14:34:49.2821789Z [rank1]:[W1204 14:14:19.159455908 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2821985Z 2025-12-04T14:34:49.2822341Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2822770Z warnings.warn( 2025-12-04T14:34:49.2823172Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py:972: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T14:34:49.2823596Z actual_saved_tensors = value.saved_tensors 2025-12-04T14:34:49.2823847Z [rank2]:[W1204 14:14:25.165247972 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2824082Z 2025-12-04T14:34:49.2824241Z [rank2]:[W1204 14:14:25.165988816 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2824441Z 2025-12-04T14:34:49.2824599Z [rank2]:[W1204 14:14:25.166264949 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2824813Z 2025-12-04T14:34:49.2825009Z [rank2]:[W1204 14:14:25.166503944 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2825206Z 2025-12-04T14:34:49.2825364Z [rank2]:[W1204 14:14:25.166723889 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2825562Z 2025-12-04T14:34:49.2825722Z [rank2]:[W1204 14:14:25.166936454 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2825919Z 2025-12-04T14:34:49.2826077Z [rank2]:[W1204 14:14:25.167153800 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2826271Z 2025-12-04T14:34:49.2826432Z [rank2]:[W1204 14:14:25.167228428 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2826629Z 2025-12-04T14:34:49.2826789Z [rank2]:[W1204 14:14:27.582929804 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2826982Z 2025-12-04T14:34:49.2827141Z [rank2]:[W1204 14:14:27.592550340 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2827336Z 2025-12-04T14:34:49.2827532Z [rank2]:[W1204 14:14:27.598100177 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2827730Z 2025-12-04T14:34:49.2827888Z [rank2]:[W1204 14:14:27.603713593 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2828084Z 2025-12-04T14:34:49.2828242Z [rank2]:[W1204 14:14:27.609242300 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2828438Z 2025-12-04T14:34:49.2828596Z [rank2]:[W1204 14:14:27.614287518 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2828796Z 2025-12-04T14:34:49.2828954Z [rank2]:[W1204 14:14:27.619364345 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2829150Z 2025-12-04T14:34:49.2829309Z [rank2]:[W1204 14:14:27.624297636 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2829506Z 2025-12-04T14:34:49.2829664Z [rank2]:[W1204 14:14:27.666443120 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2829858Z 2025-12-04T14:34:49.2830020Z [rank2]:[W1204 14:14:27.673176121 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2830215Z 2025-12-04T14:34:49.2830375Z [rank2]:[W1204 14:14:27.678187420 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2830568Z 2025-12-04T14:34:49.2830727Z [rank2]:[W1204 14:14:27.683197129 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2830921Z 2025-12-04T14:34:49.2831105Z [rank2]:[W1204 14:14:27.688142949 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2831305Z 2025-12-04T14:34:49.2831464Z [rank2]:[W1204 14:14:27.693575048 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2831660Z 2025-12-04T14:34:49.2831831Z [rank2]:[W1204 14:14:27.698550198 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2832042Z 2025-12-04T14:34:49.2832221Z [rank2]:[W1204 14:14:27.703583396 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2832418Z 2025-12-04T14:34:49.2832577Z [rank2]:[W1204 14:14:27.761575949 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2832772Z 2025-12-04T14:34:49.2832932Z [rank2]:[W1204 14:14:27.763965416 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2833130Z 2025-12-04T14:34:49.2833289Z [rank2]:[W1204 14:14:27.822353820 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2833483Z 2025-12-04T14:34:49.2833645Z [rank2]:[W1204 14:14:27.822444818 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2833840Z 2025-12-04T14:34:49.2834000Z [rank2]:[W1204 14:14:27.822667173 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2834194Z 2025-12-04T14:34:49.2834354Z [rank2]:[W1204 14:14:27.822728152 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.2834550Z 2025-12-04T14:34:49.2834909Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2835333Z warnings.warn( 2025-12-04T14:34:49.2835569Z [rank3]:I1204 14:14:32.973000 115403 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T14:34:49.2835967Z [rank0]:I1204 14:14:32.973000 115400 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T14:34:49.2836358Z [rank1]:I1204 14:14:32.973000 115401 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T14:34:49.2836746Z [rank2]:I1204 14:14:32.973000 115402 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T14:34:49.2837109Z [rank3]:I1204 14:14:32.974000 115403 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T14:34:49.2837443Z [rank0]:I1204 14:14:32.974000 115400 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T14:34:49.2837827Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T14:34:49.2838129Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2838448Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000076263ddff640 (most recent call first): 2025-12-04T14:34:49.2838938Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2839554Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2840074Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2840565Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2841048Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2841403Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2841718Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007604309fd640 (most recent call first): 2025-12-04T14:34:49.2842054Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2842329Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2842643Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000760431dff640 (most recent call first): 2025-12-04T14:34:49.2842979Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2843251Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2843562Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000076042fffc640 (most recent call first): 2025-12-04T14:34:49.2843900Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2844170Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2844479Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007604313fe640 (most recent call first): 2025-12-04T14:34:49.2844814Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2845086Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2845394Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007604d13ff640 (most recent call first): 2025-12-04T14:34:49.2845799Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2846241Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2846702Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2847190Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2847704Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2848056Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2848397Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000761e291fe640 (most recent call first): 2025-12-04T14:34:49.2848903Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2849457Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2849928Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2850391Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2850742Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2851053Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000076281f25c740 (most recent call first): 2025-12-04T14:34:49.2851531Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4958 in barrier 2025-12-04T14:34:49.2852098Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T14:34:49.2852696Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 465 in destroy_pg 2025-12-04T14:34:49.2853330Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 541 in wrapper 2025-12-04T14:34:49.2853946Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2854527Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2855104Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2855685Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2856268Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2856794Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2857282Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2857836Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2858321Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2858762Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2859087Z E1204 14:14:32.974000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2859376Z [rank1]:I1204 14:14:32.974000 115401 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T14:34:49.2859703Z [rank2]:I1204 14:14:32.975000 115402 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T14:34:49.2860041Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T14:34:49.2860335Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2860646Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079c8dc7ff640 (most recent call first): 2025-12-04T14:34:49.2861135Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2861729Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2862248Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2862703Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2863166Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2863521Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2863833Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079c91bffd640 (most recent call first): 2025-12-04T14:34:49.2864168Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2864438Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2864747Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079c91c9fe640 (most recent call first): 2025-12-04T14:34:49.2865082Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2865353Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2865664Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079c91b5fc640 (most recent call first): 2025-12-04T14:34:49.2866005Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2866279Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2866605Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079c91d3ff640 (most recent call first): 2025-12-04T14:34:49.2866946Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2867220Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2867607Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079eb0b9ff640 (most recent call first): 2025-12-04T14:34:49.2868051Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2868498Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2868962Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2869443Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2869914Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2870274Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2870601Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000079e2f4dfe640 (most recent call first): 2025-12-04T14:34:49.2871113Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2871645Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2872106Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2872573Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2872928Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2873243Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079eceba6b740 (most recent call first): 2025-12-04T14:34:49.2873721Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4958 in barrier 2025-12-04T14:34:49.2874286Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T14:34:49.2874888Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 465 in destroy_pg 2025-12-04T14:34:49.2875528Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 541 in wrapper 2025-12-04T14:34:49.2876159Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2876746Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2877361Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2877988Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2878572Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2879100Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2879593Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2880083Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2880571Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2880996Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2881298Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2881598Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T14:34:49.2881901Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2882217Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007def61fff640 (most recent call first): 2025-12-04T14:34:49.2882703Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2883300Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2883826Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2884289Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2884754Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2885126Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2885445Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcd55ffd640 (most recent call first): 2025-12-04T14:34:49.2885786Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2886074Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2886420Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcd573ff640 (most recent call first): 2025-12-04T14:34:49.2886762Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2887036Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2887349Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcd555fc640 (most recent call first): 2025-12-04T14:34:49.2887793Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2888071Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2888387Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcd569fe640 (most recent call first): 2025-12-04T14:34:49.2888726Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2889002Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2889315Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcf135fe640 (most recent call first): 2025-12-04T14:34:49.2889724Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2890166Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2890632Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2891116Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2891584Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2891940Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2892265Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007dcf44d81640 (most recent call first): 2025-12-04T14:34:49.2892771Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2893302Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2893761Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2894261Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2894615Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2894932Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007df144c11740 (most recent call first): 2025-12-04T14:34:49.2895452Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4958 in barrier 2025-12-04T14:34:49.2896017Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T14:34:49.2896617Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 465 in destroy_pg 2025-12-04T14:34:49.2897256Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 541 in wrapper 2025-12-04T14:34:49.2897916Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2898503Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2899079Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2899666Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2900251Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2900782Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2901270Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2901760Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2902251Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2902674Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2902974Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2903273Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T14:34:49.2903588Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2903903Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000702bf8fff640 (most recent call first): 2025-12-04T14:34:49.2904390Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2905033Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2905562Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2906022Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2906492Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2906851Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2907168Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000702c1d3fc640 (most recent call first): 2025-12-04T14:34:49.2907543Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2907816Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2908131Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000702c1f1ff640 (most recent call first): 2025-12-04T14:34:49.2908469Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2908746Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2909059Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000702c1ddfd640 (most recent call first): 2025-12-04T14:34:49.2909399Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2909672Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2909984Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000702c1e7fe640 (most recent call first): 2025-12-04T14:34:49.2910321Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2910594Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2910906Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000704e09bff640 (most recent call first): 2025-12-04T14:34:49.2911323Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2911768Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2912232Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2912727Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2913196Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2913576Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2913929Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007045f55fe640 (most recent call first): 2025-12-04T14:34:49.2914434Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2914957Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2915418Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2915885Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2916240Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2916553Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000704febce7740 (most recent call first): 2025-12-04T14:34:49.2917029Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4958 in barrier 2025-12-04T14:34:49.2917629Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T14:34:49.2918226Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 465 in destroy_pg 2025-12-04T14:34:49.2918866Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 541 in wrapper 2025-12-04T14:34:49.2919481Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2920066Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2920646Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2921242Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2921841Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2922369Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2922859Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2923392Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2923879Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2924301Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2924603Z E1204 14:14:32.975000 114744 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2924808Z Exception in thread Thread-1 (_event_listener): 2025-12-04T14:34:49.2924958Z Exception in thread Thread-1 (_event_listener): 2025-12-04T14:34:49.2925101Z Traceback (most recent call last): 2025-12-04T14:34:49.2925236Z Exception in thread Thread-1 (_event_listener): 2025-12-04T14:34:49.2925380Z Exception in thread Thread-1 (_event_listener): 2025-12-04T14:34:49.2925515Z Traceback (most recent call last): 2025-12-04T14:34:49.2925636Z Traceback (most recent call last): 2025-12-04T14:34:49.2925756Z Traceback (most recent call last): 2025-12-04T14:34:49.2925940Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T14:34:49.2926188Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T14:34:49.2926431Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T14:34:49.2926670Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T14:34:49.2926848Z self.run() 2025-12-04T14:34:49.2926989Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T14:34:49.2927157Z FAILED [300.0109s] [100%] 2025-12-04T14:34:49.2927225Z 2025-12-04T14:34:49.2927286Z =================================== FAILURES =================================== 2025-12-04T14:34:49.2927525Z _________ TestDTensorCompileE2E.test_2d_fsdp_tp_ac_compile_use_ca_True _________ 2025-12-04T14:34:49.2927703Z Traceback (most recent call last): 2025-12-04T14:34:49.2927944Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T14:34:49.2928190Z self._join_processes(fn) 2025-12-04T14:34:49.2928437Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T14:34:49.2928701Z self._check_return_codes(fn, elapsed_time) 2025-12-04T14:34:49.2928966Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T14:34:49.2929226Z raise RuntimeError( 2025-12-04T14:34:49.2929386Z RuntimeError: Process 0 terminated or timed out after 300.0061843395233 seconds 2025-12-04T14:34:49.2929596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T14:34:49.2929775Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T14:34:49.2930137Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-60d0b26edbeb4690.xml - 2025-12-04T14:34:49.2930494Z =========================== short test summary info ============================ 2025-12-04T14:34:49.2930874Z FAILED [300.0109s] distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True - RuntimeError: Process 0 terminated or timed out after 300.0061843395233 seconds 2025-12-04T14:34:49.2931231Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T14:34:49.2931417Z ================= 1 failed, 13 deselected in 300.02s (0:05:00) ================= 2025-12-04T14:34:49.2931785Z /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown 2025-12-04T14:34:49.2932125Z warnings.warn('resource_tracker: There appear to be %d ' 2025-12-04T14:34:49.2932273Z Got exit code 1 2025-12-04T14:34:49.2932506Z FAILED CONSISTENTLY: test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True 2025-12-04T14:34:49.2932847Z Test failed consistently, continuing with the rest of the tests due to continue-through-error being set 2025-12-04T14:34:49.2933232Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-1a7c32fb512d541b.xml 2025-12-04T14:34:49.2933542Z ============================= test session starts ============================== 2025-12-04T14:34:49.2933759Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.2933952Z cachedir: .pytest_cache 2025-12-04T14:34:49.2934180Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.2934423Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.2934544Z configfile: pytest.ini 2025-12-04T14:34:49.2934776Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.2935050Z collecting ... collected 49 items / 11 deselected / 38 selected 2025-12-04T14:34:49.2935215Z stepcurrent: skipping 11 already run items. 2025-12-04T14:34:49.2935352Z Running 3 items in this shard 2025-12-04T14:34:49.2935429Z 2025-12-04T14:34:49.2935738Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_compile_embedding_redistribute I1204 14:14:38.858000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 120123 2025-12-04T14:34:49.2936241Z I1204 14:14:38.858000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 120124 2025-12-04T14:34:49.2936590Z I1204 14:14:38.859000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 120125 2025-12-04T14:34:49.2936935Z I1204 14:14:38.859000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 120126 2025-12-04T14:34:49.2937168Z PASSED [144.8957s] [ 33%] 2025-12-04T14:34:49.2937608Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_False I1204 14:17:03.756000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 123624 2025-12-04T14:34:49.2938126Z I1204 14:17:03.757000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 123625 2025-12-04T14:34:49.2938477Z I1204 14:17:03.757000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 123626 2025-12-04T14:34:49.2938817Z I1204 14:17:03.757000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 123627 2025-12-04T14:34:49.2939047Z PASSED [207.5886s] [ 66%] 2025-12-04T14:34:49.2939455Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True I1204 14:20:31.346000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 126317 2025-12-04T14:34:49.2939976Z I1204 14:20:31.346000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 126318 2025-12-04T14:34:49.2940317Z I1204 14:20:31.347000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 126319 2025-12-04T14:34:49.2940671Z I1204 14:20:31.347000 119467 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 126320 2025-12-04T14:34:49.2941251Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2941688Z warnings.warn( 2025-12-04T14:34:49.2942107Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2942537Z warnings.warn( 2025-12-04T14:34:49.2942945Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.2943378Z warnings.warn( 2025-12-04T14:34:49.2943619Z [rank3]:I1204 14:25:31.366000 126320 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T14:34:49.2944017Z [rank1]:I1204 14:25:31.366000 126318 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T14:34:49.2944412Z [rank2]:I1204 14:25:31.366000 126319 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T14:34:49.2944779Z [rank3]:I1204 14:25:31.367000 126320 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T14:34:49.2945115Z [rank1]:I1204 14:25:31.367000 126318 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T14:34:49.2945452Z [rank2]:I1204 14:25:31.367000 126319 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T14:34:49.2945816Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:992] Could not retrieve traceback for timed out process: 0 2025-12-04T14:34:49.2946189Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T14:34:49.2946493Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2946814Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d59bfff640 (most recent call first): 2025-12-04T14:34:49.2947309Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2947958Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2948485Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2948964Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2949437Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2949795Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2950128Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d5ab3fc640 (most recent call first): 2025-12-04T14:34:49.2950499Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2950777Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2951094Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d5abdfd640 (most recent call first): 2025-12-04T14:34:49.2951435Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2951711Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2952030Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d5ac7fe640 (most recent call first): 2025-12-04T14:34:49.2952370Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2952649Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2952964Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d5ad1ff640 (most recent call first): 2025-12-04T14:34:49.2953304Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2953583Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2953899Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079d728dff640 (most recent call first): 2025-12-04T14:34:49.2954313Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2954765Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2955234Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2955718Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2956193Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2956553Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2956891Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000079ef5cdfe640 (most recent call first): 2025-12-04T14:34:49.2957403Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2957999Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2958464Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2958933Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2959338Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2959653Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079f952a16740 (most recent call first): 2025-12-04T14:34:49.2960111Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1108 in synchronize 2025-12-04T14:34:49.2960678Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 370 in benchmark_gpu 2025-12-04T14:34:49.2961260Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.2961845Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 200 in benchmark 2025-12-04T14:34:49.2962418Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.2962992Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 932 in bench 2025-12-04T14:34:49.2963578Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1072 in 2025-12-04T14:34:49.2964196Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1071 in benchmark_all_configs 2025-12-04T14:34:49.2964825Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1109 in autotune_to_one_config 2025-12-04T14:34:49.2965432Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1379 in run 2025-12-04T14:34:49.2966013Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/h7/ch76wyw2mq5t2fssxx7yoa5ddi2e3se3tenlk5embmwkuyq5rnwq.py", line 264 in call 2025-12-04T14:34:49.2966566Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.2967105Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.2967746Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.2968346Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.2968971Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.2969598Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.2970226Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.2970812Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.2971354Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.2971884Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.2972409Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.2972858Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.2973315Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.2973880Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.2974438Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.2974974Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.2975522Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.2976076Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.2976636Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.2977205Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.2977803Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.2978405Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.2978979Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.2979500Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.2980053Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.2980671Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.2981292Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.2981881Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.2982462Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.2983047Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.2983636Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.2984168Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.2984662Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.2985150Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.2985638Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.2986062Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.2986365Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2986683Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T14:34:49.2986991Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2987310Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749d45dff640 (most recent call first): 2025-12-04T14:34:49.2987856Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.2988469Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.2988993Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2989456Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2989925Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2990287Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2990604Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749d4e94f640 (most recent call first): 2025-12-04T14:34:49.2990946Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2991225Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2991540Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749d4f350640 (most recent call first): 2025-12-04T14:34:49.2991880Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2992157Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2992472Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749d4fd51640 (most recent call first): 2025-12-04T14:34:49.2992811Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2993086Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2993516Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749d50752640 (most recent call first): 2025-12-04T14:34:49.2993856Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2994133Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2994448Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000749ecc7ff640 (most recent call first): 2025-12-04T14:34:49.2994863Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.2995308Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.2995794Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.2996276Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2996745Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2997148Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.2997510Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000749ef6782640 (most recent call first): 2025-12-04T14:34:49.2998018Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.2998546Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.2999006Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.2999482Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.2999837Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3000153Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074c0f632f740 (most recent call first): 2025-12-04T14:34:49.3000615Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1108 in synchronize 2025-12-04T14:34:49.3001178Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 370 in benchmark_gpu 2025-12-04T14:34:49.3001761Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3002340Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 200 in benchmark 2025-12-04T14:34:49.3002913Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3003485Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 932 in bench 2025-12-04T14:34:49.3004076Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1072 in 2025-12-04T14:34:49.3004691Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1071 in benchmark_all_configs 2025-12-04T14:34:49.3005344Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1109 in autotune_to_one_config 2025-12-04T14:34:49.3005951Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1379 in run 2025-12-04T14:34:49.3006567Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/3a/c3auzylvni5t4ai7vjpwyk5ellly4v3zrypuyqgytg7htsh3wq2t.py", line 264 in call 2025-12-04T14:34:49.3007135Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3007708Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3008282Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3008891Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3009495Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3010115Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3010745Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3011334Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3011875Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3012406Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3012932Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3013379Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3013643Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3013912Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3014180Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3014433Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3014727Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3014996Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3015273Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3015527Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3015801Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3016070Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3016319Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3016559Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3016845Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3017149Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3017435Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3017733Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3018013Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3018298Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3018569Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3018814Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3019048Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3019300Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3019540Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3019703Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3019818Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3019977Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T14:34:49.3020091Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3020266Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dc989e3f640 (most recent call first): 2025-12-04T14:34:49.3020549Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.3020833Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.3021043Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3021263Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3021482Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3021594Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3021765Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dc9af350640 (most recent call first): 2025-12-04T14:34:49.3021903Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3022009Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3022184Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dc9b0752640 (most recent call first): 2025-12-04T14:34:49.3022317Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3022430Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3022601Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dc9afd51640 (most recent call first): 2025-12-04T14:34:49.3022823Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3022947Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3023122Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dc9ae94f640 (most recent call first): 2025-12-04T14:34:49.3023257Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3023373Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3023571Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dcb20dff640 (most recent call first): 2025-12-04T14:34:49.3023776Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.3023986Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.3024212Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.3024436Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3024651Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3024764Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3024951Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007de354afe640 (most recent call first): 2025-12-04T14:34:49.3025239Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.3025448Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3025668Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3025883Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3025990Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3026169Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007ded4b6de740 (most recent call first): 2025-12-04T14:34:49.3026419Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1108 in synchronize 2025-12-04T14:34:49.3026702Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 370 in benchmark_gpu 2025-12-04T14:34:49.3026976Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3027256Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 200 in benchmark 2025-12-04T14:34:49.3027552Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3027849Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 932 in bench 2025-12-04T14:34:49.3028147Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1072 in 2025-12-04T14:34:49.3028449Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1071 in benchmark_all_configs 2025-12-04T14:34:49.3028746Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1109 in autotune_to_one_config 2025-12-04T14:34:49.3029021Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1379 in run 2025-12-04T14:34:49.3029298Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/oh/cohe3keh75xzmw7pxqx7fggqplmf3vfe3eb3jf6loh7m75hatm3v.py", line 264 in call 2025-12-04T14:34:49.3029543Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3029802Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3030083Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3030370Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3030649Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3030949Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3031244Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3031502Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3031752Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3032008Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3032256Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3032453Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3032713Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3032989Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3033235Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3033491Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3033752Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3034011Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3034285Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3034536Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3034816Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3035081Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3035334Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3035569Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3035858Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3036164Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3036454Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3036731Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3037028Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3037323Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3037641Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3037866Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3038107Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3038330Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3038565Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3038724Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3038837Z E1204 14:25:36.371000 119467 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3038883Z FAILED [305.0281s] [100%] 2025-12-04T14:34:49.3038890Z 2025-12-04T14:34:49.3038950Z =================================== FAILURES =================================== 2025-12-04T14:34:49.3039070Z _ TestDTensorCompileE2E.test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True _ 2025-12-04T14:34:49.3039121Z Traceback (most recent call last): 2025-12-04T14:34:49.3039293Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T14:34:49.3039340Z self._join_processes(fn) 2025-12-04T14:34:49.3039518Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T14:34:49.3039575Z self._check_return_codes(fn, elapsed_time) 2025-12-04T14:34:49.3039759Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T14:34:49.3039804Z raise RuntimeError( 2025-12-04T14:34:49.3039908Z RuntimeError: Process 0 terminated or timed out after 305.0242292881012 seconds 2025-12-04T14:34:49.3039986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T14:34:49.3040061Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T14:34:49.3040323Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-1a7c32fb512d541b.xml - 2025-12-04T14:34:49.3040388Z =========================== short test summary info ============================ 2025-12-04T14:34:49.3040691Z FAILED [305.0281s] distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True - RuntimeError: Process 0 terminated or timed out after 305.0242292881012 seconds 2025-12-04T14:34:49.3040758Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T14:34:49.3040833Z ============ 1 failed, 2 passed, 11 deselected in 657.53s (0:10:57) ============ 2025-12-04T14:34:49.3041101Z /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown 2025-12-04T14:34:49.3041191Z warnings.warn('resource_tracker: There appear to be %d ' 2025-12-04T14:34:49.3041232Z Got exit code 1 2025-12-04T14:34:49.3041280Z Retrying single test... 2025-12-04T14:34:49.3041499Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-2e848ad2e669fc95.xml 2025-12-04T14:34:49.3041566Z ============================= test session starts ============================== 2025-12-04T14:34:49.3041682Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.3041730Z cachedir: .pytest_cache 2025-12-04T14:34:49.3041891Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.3041949Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.3041995Z configfile: pytest.ini 2025-12-04T14:34:49.3042167Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.3042246Z collecting ... collected 49 items / 13 deselected / 36 selected 2025-12-04T14:34:49.3042501Z stepcurrent: skipping 13 already run items. Running only test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True 2025-12-04T14:34:49.3042549Z Running 1 items in this shard 2025-12-04T14:34:49.3042552Z 2025-12-04T14:34:49.3042889Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True I1204 14:25:42.255000 129621 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 130277 2025-12-04T14:34:49.3043048Z I1204 14:25:42.256000 129621 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 130278 2025-12-04T14:34:49.3043210Z I1204 14:25:42.256000 129621 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 130279 2025-12-04T14:34:49.3043368Z I1204 14:25:42.257000 129621 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 130280 2025-12-04T14:34:49.3043526Z [rank2]:[W1204 14:28:24.480524706 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3043699Z [rank2]:[W1204 14:28:24.480557105 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3043701Z 2025-12-04T14:34:49.3043854Z [rank1]:[W1204 14:28:24.480752351 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3044008Z [rank3]:[W1204 14:28:24.480753001 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3044176Z [rank1]:[W1204 14:28:24.480785660 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3044178Z 2025-12-04T14:34:49.3044343Z [rank3]:[W1204 14:28:24.480807119 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3044345Z 2025-12-04T14:34:49.3044526Z [rank3]:[W1204 14:28:31.730070748 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3044528Z 2025-12-04T14:34:49.3044689Z [rank3]:[W1204 14:28:31.747988146 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3044692Z 2025-12-04T14:34:49.3044870Z [rank3]:[W1204 14:28:31.750624708 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3044881Z 2025-12-04T14:34:49.3045051Z [rank3]:[W1204 14:28:31.750789005 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045053Z 2025-12-04T14:34:49.3045217Z [rank3]:[W1204 14:28:31.751035519 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045219Z 2025-12-04T14:34:49.3045382Z [rank1]:[W1204 14:28:31.808791905 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045384Z 2025-12-04T14:34:49.3045543Z [rank2]:[W1204 14:28:32.028583173 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045545Z 2025-12-04T14:34:49.3045711Z [rank2]:[W1204 14:28:32.051190218 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045714Z 2025-12-04T14:34:49.3045878Z [rank2]:[W1204 14:28:32.054406978 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3045880Z 2025-12-04T14:34:49.3046043Z [rank2]:[W1204 14:28:32.054578044 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046048Z 2025-12-04T14:34:49.3046211Z [rank2]:[W1204 14:28:32.054848378 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046213Z 2025-12-04T14:34:49.3046373Z [rank1]:[W1204 14:28:32.160315919 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046375Z 2025-12-04T14:34:49.3046540Z [rank1]:[W1204 14:28:32.163197876 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046543Z 2025-12-04T14:34:49.3046703Z [rank1]:[W1204 14:28:32.163389632 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046705Z 2025-12-04T14:34:49.3046868Z [rank1]:[W1204 14:28:32.163666246 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3046870Z 2025-12-04T14:34:49.3047033Z [rank2]:[W1204 14:29:37.202831257 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047035Z 2025-12-04T14:34:49.3047194Z [rank2]:[W1204 14:29:37.207012986 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047197Z 2025-12-04T14:34:49.3047360Z [rank2]:[W1204 14:29:37.209209788 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047363Z 2025-12-04T14:34:49.3047566Z [rank2]:[W1204 14:29:37.209331375 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047568Z 2025-12-04T14:34:49.3047733Z [rank0]:[W1204 14:29:37.230141329 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047735Z 2025-12-04T14:34:49.3047908Z [rank3]:[W1204 14:29:37.257165967 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3047914Z 2025-12-04T14:34:49.3048074Z [rank3]:[W1204 14:29:37.261349265 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048093Z 2025-12-04T14:34:49.3048257Z [rank3]:[W1204 14:29:37.263521098 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048271Z 2025-12-04T14:34:49.3048443Z [rank3]:[W1204 14:29:37.263630565 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048445Z 2025-12-04T14:34:49.3048610Z [rank1]:[W1204 14:29:37.481832124 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048612Z 2025-12-04T14:34:49.3048773Z [rank1]:[W1204 14:29:37.485775168 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048779Z 2025-12-04T14:34:49.3048938Z [rank1]:[W1204 14:29:37.487315434 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3048940Z 2025-12-04T14:34:49.3049108Z [rank1]:[W1204 14:29:37.487425982 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3049110Z 2025-12-04T14:34:49.3049459Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T14:34:49.3049514Z if out == self.unknown_value: 2025-12-04T14:34:49.3049677Z [rank0]:[W1204 14:29:44.469605014 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3049679Z 2025-12-04T14:34:49.3049842Z [rank0]:[W1204 14:29:44.471986182 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3049845Z 2025-12-04T14:34:49.3050010Z [rank0]:[W1204 14:29:44.472075670 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3050013Z 2025-12-04T14:34:49.3050381Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T14:34:49.3050426Z warnings.warn( 2025-12-04T14:34:49.3050608Z [rank0]:I1204 14:30:42.339000 130277 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T14:34:49.3050792Z [rank3]:I1204 14:30:42.339000 130280 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T14:34:49.3050970Z [rank2]:I1204 14:30:42.339000 130279 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T14:34:49.3051154Z [rank1]:I1204 14:30:42.339000 130278 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T14:34:49.3051304Z [rank0]:I1204 14:30:42.341000 130277 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T14:34:49.3051468Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T14:34:49.3051583Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3051768Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e43ad9ff640 (most recent call first): 2025-12-04T14:34:49.3052054Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.3052369Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.3052592Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3052815Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3053031Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3053144Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3053317Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e21e8150640 (most recent call first): 2025-12-04T14:34:49.3053456Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3053564Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3053739Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e21e774f640 (most recent call first): 2025-12-04T14:34:49.3053872Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3053983Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3054154Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e21e8b51640 (most recent call first): 2025-12-04T14:34:49.3054291Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3054401Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3054576Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e21e9552640 (most recent call first): 2025-12-04T14:34:49.3054708Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3054820Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3054996Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e2364dff640 (most recent call first): 2025-12-04T14:34:49.3055198Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.3055408Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.3055634Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.3055869Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3056081Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3056191Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3056396Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007e3b994fe640 (most recent call first): 2025-12-04T14:34:49.3056697Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.3056903Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3057124Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3057334Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3057441Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3057655Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e458fd76740 (most recent call first): 2025-12-04T14:34:49.3057933Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/tb/ctb4cmjd462qdjcrr2cwfl5e4ae7cd4vdfc52amimwcycebfeibz.py", line 238 in call 2025-12-04T14:34:49.3058177Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3058436Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3058719Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3059004Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3059286Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3059587Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3059883Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3060140Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3060402Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3060646Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3060918Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3061099Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3061355Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3061625Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3061871Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3062123Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3062382Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3062636Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3062902Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3063151Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3063424Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3063687Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3063934Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3064170Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3064451Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3064763Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3065042Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3065319Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3065613Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3065890Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3066161Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3066384Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3066619Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3066838Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3067067Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3067222Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3067331Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3067509Z [rank2]:I1204 14:30:42.341000 130279 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T14:34:49.3067669Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T14:34:49.3067775Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3067944Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007726feaff640 (most recent call first): 2025-12-04T14:34:49.3068223Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.3068503Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.3068710Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3068931Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3069154Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3069259Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3069428Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007727029ff640 (most recent call first): 2025-12-04T14:34:49.3069578Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3069711Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3069880Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007727055ff640 (most recent call first): 2025-12-04T14:34:49.3070008Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3070114Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3070281Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007726829ff640 (most recent call first): 2025-12-04T14:34:49.3070410Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3070515Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3070686Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007726855ff640 (most recent call first): 2025-12-04T14:34:49.3070816Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3070920Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3071092Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000077267fdff640 (most recent call first): 2025-12-04T14:34:49.3071293Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.3071495Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.3071721Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.3071944Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3072152Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3072261Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3076451Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000771e66fff640 (most recent call first): 2025-12-04T14:34:49.3076770Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.3076993Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3077248Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3077460Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3077638Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3077850Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000772860ffe740 (most recent call first): 2025-12-04T14:34:49.3078132Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/h7/ch76wyw2mq5t2fssxx7yoa5ddi2e3se3tenlk5embmwkuyq5rnwq.py", line 241 in call 2025-12-04T14:34:49.3078373Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3078630Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3078914Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3079199Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3079480Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3079776Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3080072Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3080328Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3080574Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3080817Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3081060Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3081228Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3081482Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3081765Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3082009Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3082285Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3082559Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3082816Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3083084Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3083331Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3083605Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3083868Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3084115Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3084347Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3084630Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3084930Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3085214Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3085481Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3085758Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3086035Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3086317Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3086541Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3086792Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3087020Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3087249Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3087410Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3087559Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3087711Z [rank3]:I1204 14:30:42.341000 130280 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T14:34:49.3087859Z [rank1]:I1204 14:30:42.341000 130278 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T14:34:49.3088015Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T14:34:49.3088120Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3088295Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007197fe7ff640 (most recent call first): 2025-12-04T14:34:49.3088574Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.3088858Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.3089065Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3089283Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3089495Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3089600Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3089774Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007198095ff640 (most recent call first): 2025-12-04T14:34:49.3089907Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3090015Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3090183Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000071980c1ff640 (most recent call first): 2025-12-04T14:34:49.3090315Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3090441Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3090612Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000071978bbff640 (most recent call first): 2025-12-04T14:34:49.3090755Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3090888Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3091058Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000719788fff640 (most recent call first): 2025-12-04T14:34:49.3091186Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3091292Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3091459Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007197863ff640 (most recent call first): 2025-12-04T14:34:49.3091663Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.3091867Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.3092095Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.3092313Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3092523Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3092628Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3092811Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000718f6d9fe640 (most recent call first): 2025-12-04T14:34:49.3093101Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.3093302Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3093522Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3093730Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3093838Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3094010Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000719967c26740 (most recent call first): 2025-12-04T14:34:49.3094262Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1108 in synchronize 2025-12-04T14:34:49.3094554Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 370 in benchmark_gpu 2025-12-04T14:34:49.3094818Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3095106Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 200 in benchmark 2025-12-04T14:34:49.3095380Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3095651Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 932 in bench 2025-12-04T14:34:49.3095930Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1072 in 2025-12-04T14:34:49.3096225Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1071 in benchmark_all_configs 2025-12-04T14:34:49.3096528Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1109 in autotune_to_one_config 2025-12-04T14:34:49.3096797Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1379 in run 2025-12-04T14:34:49.3097075Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/3a/c3auzylvni5t4ai7vjpwyk5ellly4v3zrypuyqgytg7htsh3wq2t.py", line 264 in call 2025-12-04T14:34:49.3097316Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3097617Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3097899Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3098177Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3098457Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3098756Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3099059Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3099316Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3099570Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3099839Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3100083Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3100247Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3100503Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3100770Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3101015Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3101267Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3101528Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3101782Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3102048Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3102297Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3102568Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3102830Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3103078Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3103309Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3103600Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3103900Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3104218Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3104483Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3104760Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3105036Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3105307Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3105529Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3105759Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3105980Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3106212Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3106370Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3106476Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3106632Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T14:34:49.3106738Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3106907Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c661bdff640 (most recent call first): 2025-12-04T14:34:49.3107186Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 76 in _recv_msg 2025-12-04T14:34:49.3107469Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 271 in _read_thread 2025-12-04T14:34:49.3107707Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3107938Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3108148Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3108273Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3108468Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c6623fff640 (most recent call first): 2025-12-04T14:34:49.3108601Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3108705Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3108875Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c6628bff640 (most recent call first): 2025-12-04T14:34:49.3109004Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3109108Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3109277Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c65a39ff640 (most recent call first): 2025-12-04T14:34:49.3109408Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3109513Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3109682Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c65a0dff640 (most recent call first): 2025-12-04T14:34:49.3109813Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3109918Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3110086Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c659e1ff640 (most recent call first): 2025-12-04T14:34:49.3110291Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 324 in wait 2025-12-04T14:34:49.3110494Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 607 in wait 2025-12-04T14:34:49.3110717Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run 2025-12-04T14:34:49.3110937Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3111149Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3111253Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3111438Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007c5d898fe640 (most recent call first): 2025-12-04T14:34:49.3111723Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T14:34:49.3111935Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T14:34:49.3112153Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T14:34:49.3112376Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T14:34:49.3112501Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3112673Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c677fcb8740 (most recent call first): 2025-12-04T14:34:49.3112923Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1108 in synchronize 2025-12-04T14:34:49.3113198Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 370 in benchmark_gpu 2025-12-04T14:34:49.3113466Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3113733Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 200 in benchmark 2025-12-04T14:34:49.3113997Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/benchmarking.py", line 92 in wrapper 2025-12-04T14:34:49.3114267Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 932 in bench 2025-12-04T14:34:49.3114544Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1072 in 2025-12-04T14:34:49.3114840Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1071 in benchmark_all_configs 2025-12-04T14:34:49.3115138Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1109 in autotune_to_one_config 2025-12-04T14:34:49.3115406Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1379 in run 2025-12-04T14:34:49.3115680Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/tmp/torchinductor_jenkins/oh/cohe3keh75xzmw7pxqx7fggqplmf3vfe3eb3jf6loh7m75hatm3v.py", line 264 in call 2025-12-04T14:34:49.3115919Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3247 in run 2025-12-04T14:34:49.3116171Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 627 in __call__ 2025-12-04T14:34:49.3116459Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729 in inner_fn 2025-12-04T14:34:49.3116738Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 695 in inner_fn 2025-12-04T14:34:49.3117043Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531 in wrapper 2025-12-04T14:34:49.3117339Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134 in call_func_at_runtime_with_args 2025-12-04T14:34:49.3117666Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357 in runtime_wrapper 2025-12-04T14:34:49.3117921Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1135 in forward 2025-12-04T14:34:49.3118165Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3118403Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 4652 in wrapper 2025-12-04T14:34:49.3118648Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1154 in _fn 2025-12-04T14:34:49.3118812Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File ".4", line 4 in forward 2025-12-04T14:34:49.3119065Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3119332Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3119575Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 442 in __call__ 2025-12-04T14:34:49.3119826Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph_module.py", line 936 in call_wrapped 2025-12-04T14:34:49.3120083Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 926 in compile_wrapper 2025-12-04T14:34:49.3120338Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789 in _call_impl 2025-12-04T14:34:49.3120605Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778 in _wrapped_call_impl 2025-12-04T14:34:49.3120867Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 441 in __call__ 2025-12-04T14:34:49.3121138Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/compiled_autograd.py", line 1132 in runtime_wrapper 2025-12-04T14:34:49.3121442Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 865 in _engine_run_backward 2025-12-04T14:34:49.3121688Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 364 in backward 2025-12-04T14:34:49.3121921Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 629 in backward 2025-12-04T14:34:49.3122200Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/tensor/test_dtensor_compile.py", line 1297 in test_tp_compile_fullgraph 2025-12-04T14:34:49.3122501Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 536 in wrapper 2025-12-04T14:34:49.3122782Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 586 in instantiated_test 2025-12-04T14:34:49.3123049Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T14:34:49.3123321Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T14:34:49.3123597Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T14:34:49.3123869Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T14:34:49.3124091Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T14:34:49.3124322Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T14:34:49.3124545Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T14:34:49.3124776Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T14:34:49.3124932Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T14:34:49.3125038Z E1204 14:30:42.341000 129621 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T14:34:49.3125109Z Exception in thread Thread-1 (_event_listener): 2025-12-04T14:34:49.3125158Z Traceback (most recent call last): 2025-12-04T14:34:49.3125267Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T14:34:49.3125304Z self.run() 2025-12-04T14:34:49.3125404Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T14:34:49.3125457Z self._target(*self._args, **self._kwargs) 2025-12-04T14:34:49.3125655Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T14:34:49.3125701Z event = parent_pipe.recv() 2025-12-04T14:34:49.3125816Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T14:34:49.3125859Z buf = self._recv_bytes() 2025-12-04T14:34:49.3125981Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T14:34:49.3126024Z buf = self._recv(4) 2025-12-04T14:34:49.3126136Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T14:34:49.3126173Z raise EOFError 2025-12-04T14:34:49.3126209Z EOFError 2025-12-04T14:34:49.3126251Z FAILED [300.0900s] [100%] 2025-12-04T14:34:49.3126255Z 2025-12-04T14:34:49.3126312Z =================================== FAILURES =================================== 2025-12-04T14:34:49.3126422Z _ TestDTensorCompileE2E.test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True _ 2025-12-04T14:34:49.3126469Z Traceback (most recent call last): 2025-12-04T14:34:49.3126632Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T14:34:49.3126675Z self._join_processes(fn) 2025-12-04T14:34:49.3126846Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T14:34:49.3126902Z self._check_return_codes(fn, elapsed_time) 2025-12-04T14:34:49.3127077Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T14:34:49.3127118Z raise RuntimeError( 2025-12-04T14:34:49.3127213Z RuntimeError: Process 0 terminated or timed out after 300.0845081806183 seconds 2025-12-04T14:34:49.3127290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T14:34:49.3127358Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T14:34:49.3127655Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-2e848ad2e669fc95.xml - 2025-12-04T14:34:49.3127717Z =========================== short test summary info ============================ 2025-12-04T14:34:49.3128010Z FAILED [300.0900s] distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True - RuntimeError: Process 0 terminated or timed out after 300.0845081806183 seconds 2025-12-04T14:34:49.3128074Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T14:34:49.3128142Z ================= 1 failed, 13 deselected in 300.10s (0:05:00) ================= 2025-12-04T14:34:49.3128386Z /opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown 2025-12-04T14:34:49.3128458Z warnings.warn('resource_tracker: There appear to be %d ' 2025-12-04T14:34:49.3128497Z Got exit code 1 2025-12-04T14:34:49.3128537Z Retrying single test... 2025-12-04T14:34:49.3128755Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-d087fac6beb3c040.xml 2025-12-04T14:34:49.3128828Z ============================= test session starts ============================== 2025-12-04T14:34:49.3128946Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.3128986Z cachedir: .pytest_cache 2025-12-04T14:34:49.3129149Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.3129209Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.3129270Z configfile: pytest.ini 2025-12-04T14:34:49.3129449Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.3129526Z collecting ... collected 49 items / 13 deselected / 36 selected 2025-12-04T14:34:49.3129778Z stepcurrent: skipping 13 already run items. Running only test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True 2025-12-04T14:34:49.3129821Z Running 1 items in this shard 2025-12-04T14:34:49.3129825Z 2025-12-04T14:34:49.3130155Z distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True I1204 14:30:48.160000 133596 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 134252 2025-12-04T14:34:49.3130312Z I1204 14:30:48.160000 133596 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 134253 2025-12-04T14:34:49.3130466Z I1204 14:30:48.160000 133596 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 134254 2025-12-04T14:34:49.3130616Z I1204 14:30:48.161000 133596 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 134255 2025-12-04T14:34:49.3130771Z [rank1]:[W1204 14:31:20.777898161 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3130936Z [rank1]:[W1204 14:31:20.777931580 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3130940Z 2025-12-04T14:34:49.3131088Z [rank2]:[W1204 14:31:20.777959880 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3131237Z [rank3]:[W1204 14:31:20.777959900 unwind.cpp:219] Warning: Unsupported unwinding pattern: Address not in range (function unwinderFor) 2025-12-04T14:34:49.3131400Z [rank2]:[W1204 14:31:20.777997229 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3131403Z 2025-12-04T14:34:49.3131561Z [rank3]:[W1204 14:31:20.778004759 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3131563Z 2025-12-04T14:34:49.3131720Z [rank3]:[W1204 14:31:28.410106302 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3131722Z 2025-12-04T14:34:49.3131880Z [rank1]:[W1204 14:31:28.413855030 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3131882Z 2025-12-04T14:34:49.3132039Z [rank1]:[W1204 14:31:28.433864581 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132043Z 2025-12-04T14:34:49.3132202Z [rank1]:[W1204 14:31:28.436462024 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132204Z 2025-12-04T14:34:49.3132363Z [rank1]:[W1204 14:31:28.436617211 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132365Z 2025-12-04T14:34:49.3132533Z [rank1]:[W1204 14:31:28.436851516 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132535Z 2025-12-04T14:34:49.3132693Z [rank2]:[W1204 14:31:28.455485067 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132695Z 2025-12-04T14:34:49.3132861Z [rank2]:[W1204 14:31:28.477467735 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3132876Z 2025-12-04T14:34:49.3133042Z [rank2]:[W1204 14:31:28.480341052 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133044Z 2025-12-04T14:34:49.3133203Z [rank2]:[W1204 14:31:28.480498298 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133205Z 2025-12-04T14:34:49.3133364Z [rank2]:[W1204 14:31:28.480744053 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133366Z 2025-12-04T14:34:49.3133523Z [rank3]:[W1204 14:31:28.740498625 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133525Z 2025-12-04T14:34:49.3133682Z [rank3]:[W1204 14:31:28.743963869 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133686Z 2025-12-04T14:34:49.3133845Z [rank3]:[W1204 14:31:28.744161824 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3133846Z 2025-12-04T14:34:49.3134002Z [rank3]:[W1204 14:31:28.744461458 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134004Z 2025-12-04T14:34:49.3134161Z [rank1]:[W1204 14:32:34.701653154 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134163Z 2025-12-04T14:34:49.3134320Z [rank1]:[W1204 14:32:34.706490218 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134322Z 2025-12-04T14:34:49.3134478Z [rank1]:[W1204 14:32:34.708093523 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134483Z 2025-12-04T14:34:49.3134639Z [rank1]:[W1204 14:32:34.708188411 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134641Z 2025-12-04T14:34:49.3134797Z [rank0]:[W1204 14:32:34.718839227 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134799Z 2025-12-04T14:34:49.3134955Z [rank2]:[W1204 14:32:34.786303776 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3134957Z 2025-12-04T14:34:49.3135115Z [rank2]:[W1204 14:32:34.791090851 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135118Z 2025-12-04T14:34:49.3135276Z [rank2]:[W1204 14:32:34.792685326 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135280Z 2025-12-04T14:34:49.3135437Z [rank2]:[W1204 14:32:34.792799213 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135439Z 2025-12-04T14:34:49.3135595Z [rank3]:[W1204 14:32:35.042339006 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135597Z 2025-12-04T14:34:49.3135763Z [rank3]:[W1204 14:32:35.046959065 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135765Z 2025-12-04T14:34:49.3135923Z [rank3]:[W1204 14:32:35.048692647 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3135940Z 2025-12-04T14:34:49.3136097Z [rank3]:[W1204 14:32:35.048809854 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3136109Z 2025-12-04T14:34:49.3136466Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/constant_folding.py:256: UserWarning: Unsupported unwinding pattern: Address not in range (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/profiler/unwind/unwind.cpp:219.) 2025-12-04T14:34:49.3136513Z if out == self.unknown_value: 2025-12-04T14:34:49.3136672Z [rank0]:[W1204 14:32:41.389514473 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3136674Z 2025-12-04T14:34:49.3136832Z [rank0]:[W1204 14:32:41.392187874 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3136834Z 2025-12-04T14:34:49.3136990Z [rank0]:[W1204 14:32:41.392279892 Module.cpp:201] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1... 2025-12-04T14:34:49.3136994Z 2025-12-04T14:34:49.3137036Z PASSED [234.6297s] [100%] 2025-12-04T14:34:49.3137038Z 2025-12-04T14:34:49.3137298Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-d087fac6beb3c040.xml - 2025-12-04T14:34:49.3137370Z ================= 1 passed, 13 deselected in 234.64s (0:03:54) ================= 2025-12-04T14:34:49.3137407Z Got exit code 0 2025-12-04T14:34:49.3137526Z Test succeeded in new process, continuing with the rest of the tests 2025-12-04T14:34:49.3137746Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-6524afc1cf1c470e.xml 2025-12-04T14:34:49.3137804Z ============================= test session starts ============================== 2025-12-04T14:34:49.3137920Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T14:34:49.3137961Z cachedir: .pytest_cache 2025-12-04T14:34:49.3138122Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T14:34:49.3138168Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T14:34:49.3138209Z configfile: pytest.ini 2025-12-04T14:34:49.3138370Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T14:34:49.3138445Z collecting ... collected 49 items / 14 deselected / 35 selected 2025-12-04T14:34:49.3138500Z stepcurrent: skipping 14 already run items. 2025-12-04T14:34:49.3138543Z Running 0 items in this shard 2025-12-04T14:34:49.3138545Z 2025-12-04T14:34:49.3138799Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.tensor.test_dtensor_compile/distributed.tensor.test_dtensor_compile-6524afc1cf1c470e.xml - 2025-12-04T14:34:49.3138859Z ============================ 14 deselected in 0.01s ============================ 2025-12-04T14:34:49.3139125Z The following tests failed and then succeeded when run in a new process['test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_False_use_ca_True'] 2025-12-04T14:34:49.3139333Z The following tests failed consistently: ['test/distributed/tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_2d_fsdp_tp_ac_compile_use_ca_True'] 2025-12-04T14:34:49.3139335Z 2025-12-04T14:34:49.3139559Z FINISHED PRINTING LOG FILE of distributed/tensor/test_dtensor_compile 3/4 (test/test-reports/distributed.tensor.test_dtensor_compile_3.4_a5d422218d59addd_.log) 2025-12-04T14:34:49.3139561Z 2025-12-04T14:34:49.3139701Z Finished distributed/tensor/test_dtensor_compile 3/4 ... [2025-12-04 14:34:49.258083][2264644.194331357], took 37.80min 2025-12-04T14:34:49.3139968Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:34:49.3140111Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:34:49.3140208Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T14:34:49.3140258Z Uploading artifacts took 0.00 seconds 2025-12-04T14:34:49.3140321Z distributed/tensor/test_dtensor_compile 3/4 failed! 2025-12-04T14:34:49.3140461Z Running distributed/checkpoint/_experimental/test_barriers 1/1 ... [2025-12-04 14:34:49.259723][2264644.195974141] 2025-12-04T14:34:49.3140512Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:34:49.3140857Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/_experimental/test_barriers.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:34:49.259913] 2025-12-04T14:34:51.4779870Z 2025-12-04T14:34:51.4780414Z distributed/checkpoint/_experimental/test_barriers 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint._experimental.test_barriers_1.1_65cb54595ed999d1_.log 2025-12-04T14:34:51.4781371Z Running 2 items in this shard: test/distributed/checkpoint/_experimental/test_barriers.py::TestBarriers::test_execute_barrier, test/distributed/checkpoint/_experimental/test_barriers.py::TestBarriers::test_tcpstore_barrier_initialization 2025-12-04T14:34:51.4781904Z 2025-12-04T14:34:51.4782123Z Finished distributed/checkpoint/_experimental/test_barriers 1/1 ... [2025-12-04 14:34:51.477656][2264646.413902798], took 0.04min 2025-12-04T14:34:51.4785932Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:34:51.4797150Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:34:51.4800092Z Running distributed/pipelining/test_transformer 1/1 ... [2025-12-04 14:34:51.479855][2264646.41610592] 2025-12-04T14:34:51.4800332Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:34:51.4801879Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/pipelining/test_transformer.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:34:51.480077] 2025-12-04T14:35:07.3220135Z 2025-12-04T14:35:07.3221441Z distributed/pipelining/test_transformer 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.pipelining.test_transformer_1.1_4d1ccce35dfb4ae3_.log 2025-12-04T14:35:07.3222594Z Running 1 items in this shard: test/distributed/pipelining/test_transformer.py::TransformerTestsCUDA::test_ir_cuda 2025-12-04T14:35:07.3223027Z 2025-12-04T14:35:07.3223366Z Finished distributed/pipelining/test_transformer 1/1 ... [2025-12-04 14:35:07.321732][2264662.257977012], took 0.26min 2025-12-04T14:35:07.3230028Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:35:07.3237375Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:35:07.3240117Z Running distributed/flight_recorder/test_fr_analysis 1/1 ... [2025-12-04 14:35:07.323889][2264662.260140365] 2025-12-04T14:35:07.3240515Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:35:07.3242508Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/flight_recorder/test_fr_analysis.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:35:07.324090] 2025-12-04T14:35:09.4418226Z 2025-12-04T14:35:09.4419379Z distributed/flight_recorder/test_fr_analysis 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.flight_recorder.test_fr_analysis_1.1_f6e7383a5b0ad561_.log 2025-12-04T14:35:09.4421535Z Running 4 items in this shard: test/distributed/flight_recorder/test_fr_analysis.py::FlightRecorderEventTest::test_all_events, test/distributed/flight_recorder/test_fr_analysis.py::FlightRecorderEventTest::test_match_one_event, test/distributed/flight_recorder/test_fr_analysis.py::FlightMatchInfoTest::test_match_info, test/distributed/flight_recorder/test_fr_analysis.py::FlightRecorderE2ETest::testBuildDB 2025-12-04T14:35:09.4422681Z 2025-12-04T14:35:09.4422998Z Finished distributed/flight_recorder/test_fr_analysis 1/1 ... [2025-12-04 14:35:09.441590][2264664.377837149], took 0.04min 2025-12-04T14:35:09.4423976Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:35:09.4435852Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:35:09.4438546Z Running distributed/_composable/test_contract 1/1 ... [2025-12-04 14:35:09.443765][2264664.380015491] 2025-12-04T14:35:09.4438889Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:35:09.4440683Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_composable/test_contract.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:35:09.443968] 2025-12-04T14:35:11.5618298Z 2025-12-04T14:35:11.5619538Z distributed/_composable/test_contract 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed._composable.test_contract_1.1_0704deec42127662_.log 2025-12-04T14:35:11.5621462Z Running 5 items in this shard: test/distributed/_composable/test_contract.py::TestContract::test_add_hooks, test/distributed/_composable/test_contract.py::TestContract::test_modify_fqn, test/distributed/_composable/test_contract.py::TestContract::test_multi_module_api, test/distributed/_composable/test_contract.py::TestContract::test_registry, test/distributed/_composable/test_contract.py::TestContract::test_state 2025-12-04T14:35:11.5622958Z 2025-12-04T14:35:11.5623273Z Finished distributed/_composable/test_contract 1/1 ... [2025-12-04 14:35:11.561569][2264666.497815363], took 0.04min 2025-12-04T14:35:11.5624636Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:35:11.5636437Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:35:11.5639031Z Running distributed/checkpoint/test_dedup_tensors 1/1 ... [2025-12-04 14:35:11.563801][2264666.500052174] 2025-12-04T14:35:11.5639389Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:35:11.5641175Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_dedup_tensors.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:35:11.563996] 2025-12-04T14:35:13.7320462Z 2025-12-04T14:35:13.7321496Z distributed/checkpoint/test_dedup_tensors 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_dedup_tensors_1.1_aa194e1a9499a8af_.log 2025-12-04T14:35:13.7322480Z Running 1 items in this shard: test/distributed/checkpoint/test_dedup_tensors.py::TestDedupTensor::test_dedup_shards 2025-12-04T14:35:13.7323136Z 2025-12-04T14:35:13.7323422Z Finished distributed/checkpoint/test_dedup_tensors 1/1 ... [2025-12-04 14:35:13.731891][2264668.66813827], took 0.04min 2025-12-04T14:35:13.7328170Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T14:35:13.7337930Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T14:35:13.7340266Z Running distributed/test_c10d_functional_native 1/1 ... [2025-12-04 14:35:13.733938][2264668.670189495] 2025-12-04T14:35:13.7340598Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T14:35:13.7342221Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_c10d_functional_native.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 14:35:13.734113] 2025-12-04T15:08:26.9396162Z 2025-12-04T15:08:26.9397190Z PRINTING LOG FILE of distributed/test_c10d_functional_native 1/1 (test/test-reports/distributed.test_c10d_functional_native_1.1_46b22b661a00e22b_.log) 2025-12-04T15:08:26.9398682Z Test results will be stored in test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-e2bcf2ed36e7b80f.xml 2025-12-04T15:08:26.9399559Z ============================= test session starts ============================== 2025-12-04T15:08:26.9400170Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T15:08:26.9400725Z cachedir: .pytest_cache 2025-12-04T15:08:26.9401352Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T15:08:26.9402018Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T15:08:26.9402354Z configfile: pytest.ini 2025-12-04T15:08:26.9402979Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T15:08:26.9403645Z collecting ... collected 33 items 2025-12-04T15:08:26.9404020Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T15:08:26.9413649Z Running 33 items in this shard: test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_gather_into_tensor_coalesced, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_gather_into_tensor_single, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_coalesced, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_coalesced_, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_single, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_single_, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_to_all_single, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_broadcast, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_fixed_striding, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_functional_collectives_inference_mode, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_inductor_dtypeview_memory_leak, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_coalesced, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_out, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_single, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_threading, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited, test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor, test/distributed/test_c10d_functional_native.py::PyWorkTest::test_collectives, test/distributed/test_c10d_functional_native.py::PyWorkTest::test_wait_tensor, test/distributed/test_c10d_functional_native.py::CompileTestCPU::test_inductor_all_reduce_cpu, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_gather_into_tensor_coalesced, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_gather_into_tensor_single, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_coalesced, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_non_contig_input, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_to_all_single, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_broadcast, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_inplace_op_on_view, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reduce_scatter_tensor_coalesced, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reduce_scatter_tensor_single, test/distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reuse_buffer_after_inplace_collective, test/distributed/test_c10d_functional_native.py::CompileTest::test_ranks_and_tag, test/distributed/test_c10d_functional_native.py::CompileTest::test_wait_tensor 2025-12-04T15:08:26.9419357Z 2025-12-04T15:08:26.9419675Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_gather_into_tensor_coalesced I1204 14:35:18.384000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 139160 2025-12-04T15:08:26.9420201Z I1204 14:35:18.384000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 139161 2025-12-04T15:08:26.9420454Z PASSED [8.9178s] [ 3%] 2025-12-04T15:08:26.9420827Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_gather_into_tensor_single I1204 14:35:27.305000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 140495 2025-12-04T15:08:26.9421332Z I1204 14:35:27.306000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 140496 2025-12-04T15:08:26.9421587Z PASSED [148.3032s] [ 6%] 2025-12-04T15:08:26.9421947Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_coalesced I1204 14:37:55.610000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 141830 2025-12-04T15:08:26.9422444Z I1204 14:37:55.611000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 141831 2025-12-04T15:08:26.9422692Z PASSED [8.7158s] [ 9%] 2025-12-04T15:08:26.9423049Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_coalesced_ I1204 14:38:04.327000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 143165 2025-12-04T15:08:26.9423538Z I1204 14:38:04.328000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 143166 2025-12-04T15:08:26.9423785Z PASSED [145.3952s] [ 12%] 2025-12-04T15:08:26.9424137Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_single I1204 14:40:29.724000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 144500 2025-12-04T15:08:26.9424623Z I1204 14:40:29.725000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 144501 2025-12-04T15:08:26.9424870Z PASSED [148.6942s] [ 15%] 2025-12-04T15:08:26.9425224Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_reduce_single_ I1204 14:42:58.420000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 145835 2025-12-04T15:08:26.9425711Z I1204 14:42:58.421000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 145836 2025-12-04T15:08:26.9425994Z PASSED [148.8895s] [ 18%] 2025-12-04T15:08:26.9426340Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_all_to_all_single I1204 14:45:27.312000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 147170 2025-12-04T15:08:26.9426844Z I1204 14:45:27.312000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 147171 2025-12-04T15:08:26.9427105Z PASSED [8.9149s] [ 21%] 2025-12-04T15:08:26.9427449Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_broadcast I1204 14:45:36.228000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 148507 2025-12-04T15:08:26.9427930Z I1204 14:45:36.229000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 148508 2025-12-04T15:08:26.9428157Z PASSED [8.5146s] [ 24%] 2025-12-04T15:08:26.9428473Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_fixed_striding I1204 14:45:44.745000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 149842 2025-12-04T15:08:26.9428912Z I1204 14:45:44.745000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 149843 2025-12-04T15:08:26.9429138Z PASSED [183.2401s] [ 27%] 2025-12-04T15:08:26.9429490Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_functional_collectives_inference_mode I1204 14:48:47.986000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 151987 2025-12-04T15:08:26.9429992Z I1204 14:48:47.987000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 151988 2025-12-04T15:08:26.9430217Z PASSED [117.4662s] [ 30%] 2025-12-04T15:08:26.9430561Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_inductor_dtypeview_memory_leak I1204 14:50:45.454000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 153322 2025-12-04T15:08:26.9431026Z I1204 14:50:45.454000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 153323 2025-12-04T15:08:26.9431250Z PASSED [164.4306s] [ 33%] 2025-12-04T15:08:26.9431589Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_coalesced I1204 14:53:29.886000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 154931 2025-12-04T15:08:26.9432053Z I1204 14:53:29.886000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 154932 2025-12-04T15:08:26.9432278Z PASSED [146.2966s] [ 36%] 2025-12-04T15:08:26.9432606Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_out I1204 14:55:56.184000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 156266 2025-12-04T15:08:26.9433060Z I1204 14:55:56.185000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 156267 2025-12-04T15:08:26.9433283Z PASSED [8.7139s] [ 39%] 2025-12-04T15:08:26.9433616Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_reduce_scatter_tensor_single I1204 14:56:04.899000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 157601 2025-12-04T15:08:26.9434078Z I1204 14:56:04.900000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 157602 2025-12-04T15:08:26.9434303Z PASSED [164.5066s] [ 42%] 2025-12-04T15:08:26.9434615Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_threading I1204 14:58:49.408000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 158936 2025-12-04T15:08:26.9435050Z I1204 14:58:49.408000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 158937 2025-12-04T15:08:26.9435290Z PASSED [150.3969s] [ 45%] 2025-12-04T15:08:26.9435599Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited I1204 15:01:19.806000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 160891 2025-12-04T15:08:26.9436028Z I1204 15:01:19.807000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 160892 2025-12-04T15:08:26.9436461Z [W1204 15:03:41.753060612 ProcessGroup.cpp:367] Warning: At the time of process termination, there are still 1 unwaited collective calls. Please review your program to ensure that: 2025-12-04T15:08:26.9436839Z 1. c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective, 2025-12-04T15:08:26.9437206Z 2. c10d_functional.wait_tensor() is invoked on all output tensors of async_op=True torch.distributed collective called under `with allow_inflight_collective_as_graph_input_ctx():`, 2025-12-04T15:08:26.9437584Z before the output tensors of the collective are used. (function ~WorkRegistry) 2025-12-04T15:08:26.9437923Z [W1204 15:03:42.176925204 ProcessGroup.cpp:367] Warning: At the time of process termination, there are still 1 unwaited collective calls. Please review your program to ensure that: 2025-12-04T15:08:26.9438279Z 1. c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective, 2025-12-04T15:08:26.9438643Z 2. c10d_functional.wait_tensor() is invoked on all output tensors of async_op=True torch.distributed collective called under `with allow_inflight_collective_as_graph_input_ctx():`, 2025-12-04T15:08:26.9438983Z before the output tensors of the collective are used. (function ~WorkRegistry) 2025-12-04T15:08:26.9439149Z PASSED [142.7867s] [ 48%] 2025-12-04T15:08:26.9439466Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor I1204 15:03:42.595000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 162226 2025-12-04T15:08:26.9439908Z I1204 15:03:42.595000 138504 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 162227 2025-12-04T15:08:26.9440155Z Exception in thread Thread-1 (_event_listener): 2025-12-04T15:08:26.9440291Z Traceback (most recent call last): 2025-12-04T15:08:26.9440481Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T15:08:26.9440661Z self.run() 2025-12-04T15:08:26.9440803Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T15:08:26.9440982Z Exception in thread Thread-1 (_event_listener): 2025-12-04T15:08:26.9441117Z Traceback (most recent call last): 2025-12-04T15:08:26.9441297Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T15:08:26.9441487Z self._target(*self._args, **self._kwargs) 2025-12-04T15:08:26.9441745Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T15:08:26.9441999Z event = parent_pipe.recv() 2025-12-04T15:08:26.9442183Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T15:08:26.9442373Z buf = self._recv_bytes() 2025-12-04T15:08:26.9442563Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T15:08:26.9442760Z buf = self._recv(4) 2025-12-04T15:08:26.9442935Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T15:08:26.9443118Z raise EOFError 2025-12-04T15:08:26.9443210Z EOFError 2025-12-04T15:08:26.9443295Z self.run() 2025-12-04T15:08:26.9443434Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T15:08:26.9443606Z self._target(*self._args, **self._kwargs) 2025-12-04T15:08:26.9443881Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T15:08:26.9444136Z event = parent_pipe.recv() 2025-12-04T15:08:26.9444317Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T15:08:26.9444503Z buf = self._recv_bytes() 2025-12-04T15:08:26.9444708Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T15:08:26.9444901Z buf = self._recv(4) 2025-12-04T15:08:26.9445125Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T15:08:26.9445309Z raise EOFError 2025-12-04T15:08:26.9445400Z EOFError 2025-12-04T15:08:26.9445448Z 2025-12-04T15:08:26.9445450Z 2025-12-04T15:08:26.9445715Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-e2bcf2ed36e7b80f.xml - 2025-12-04T15:08:26.9446066Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T15:08:26.9446325Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py:1036: KeyboardInterrupt 2025-12-04T15:08:26.9446596Z (to show a full traceback on KeyboardInterrupt use --full-trace) 2025-12-04T15:08:26.9446768Z ======================= 16 passed in 1795.50s (0:29:55) ======================== 2025-12-04T15:08:26.9446912Z Command took >30min, returning 124 2025-12-04T15:08:26.9447027Z Got exit code 124 2025-12-04T15:08:26.9447124Z Retrying single test... 2025-12-04T15:08:26.9447395Z Test results will be stored in test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-b2d09b5b0a727bbd.xml 2025-12-04T15:08:26.9447747Z ============================= test session starts ============================== 2025-12-04T15:08:26.9447957Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T15:08:26.9448143Z cachedir: .pytest_cache 2025-12-04T15:08:26.9448364Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T15:08:26.9448600Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T15:08:26.9448718Z configfile: pytest.ini 2025-12-04T15:08:26.9448944Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T15:08:26.9449214Z collecting ... collected 33 items / 32 deselected / 1 selected 2025-12-04T15:08:26.9449498Z stepcurrent: skipping 16 already run items. Running only test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor 2025-12-04T15:08:26.9449748Z Running 1 items in this shard 2025-12-04T15:08:26.9449819Z 2025-12-04T15:08:26.9450075Z distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor I1204 15:05:23.783000 163547 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 164217 2025-12-04T15:08:26.9450517Z I1204 15:05:23.784000 163547 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 164218 2025-12-04T15:08:26.9450747Z PASSED [144.7905s] [100%] 2025-12-04T15:08:26.9450814Z 2025-12-04T15:08:26.9451065Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-b2d09b5b0a727bbd.xml - 2025-12-04T15:08:26.9451422Z ================= 1 passed, 32 deselected in 144.80s (0:02:24) ================= 2025-12-04T15:08:26.9451566Z Got exit code 0 2025-12-04T15:08:26.9451704Z Test succeeded in new process, continuing with the rest of the tests 2025-12-04T15:08:26.9452025Z Test results will be stored in test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-35d814276547fc24.xml 2025-12-04T15:08:26.9452325Z ============================= test session starts ============================== 2025-12-04T15:08:26.9452555Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T15:08:26.9452740Z cachedir: .pytest_cache 2025-12-04T15:08:26.9452960Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T15:08:26.9453208Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T15:08:26.9453324Z configfile: pytest.ini 2025-12-04T15:08:26.9453576Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T15:08:26.9453843Z collecting ... collected 33 items / 17 deselected / 16 selected 2025-12-04T15:08:26.9454006Z stepcurrent: skipping 17 already run items. 2025-12-04T15:08:26.9454135Z Running 16 items in this shard 2025-12-04T15:08:26.9454207Z 2025-12-04T15:08:26.9454408Z distributed/test_c10d_functional_native.py::PyWorkTest::test_collectives [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 2025-12-04T15:08:26.9454678Z PASSED [0.4657s] [ 6%] 2025-12-04T15:08:26.9454857Z distributed/test_c10d_functional_native.py::PyWorkTest::test_wait_tensor PASSED [0.1065s] [ 12%] 2025-12-04T15:08:26.9455152Z distributed/test_c10d_functional_native.py::CompileTestCPU::test_inductor_all_reduce_cpu PASSED [13.0856s] [ 18%] 2025-12-04T15:08:26.9455893Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_gather_into_tensor_coalesced SKIPPED [0.0004s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/146806 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 25%] 2025-12-04T15:08:26.9457005Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_gather_into_tensor_single SKIPPED [0.0003s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/147707 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 31%] 2025-12-04T15:08:26.9457757Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_coalesced PASSED [4.8624s] [ 37%] 2025-12-04T15:08:26.9458088Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_non_contig_input PASSED [0.3032s] [ 43%] 2025-12-04T15:08:26.9458408Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single PASSED [5.2846s] [ 50%] 2025-12-04T15:08:26.9459092Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_to_all_single SKIPPED [0.0005s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/147795 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 56%] 2025-12-04T15:08:26.9460140Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_broadcast SKIPPED [0.0003s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/147816 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 62%] 2025-12-04T15:08:26.9461195Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_inplace_op_on_view SKIPPED [0.0002s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/147852 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [ 68%] 2025-12-04T15:08:26.9461923Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reduce_scatter_tensor_coalesced PASSED [2.6951s] [ 75%] 2025-12-04T15:08:26.9462265Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reduce_scatter_tensor_single PASSED [2.2584s] [ 81%] 2025-12-04T15:08:26.9462638Z distributed/test_c10d_functional_native.py::CompileTest::test_inductor_reuse_buffer_after_inplace_collective PASSED [1.1051s] [ 87%] 2025-12-04T15:08:26.9462986Z distributed/test_c10d_functional_native.py::CompileTest::test_ranks_and_tag PASSED [0.2851s] [ 93%] 2025-12-04T15:08:26.9463636Z distributed/test_c10d_functional_native.py::CompileTest::test_wait_tensor SKIPPED [0.0005s] (Test is disabled because an issue exists disabling it: https://github.com/pytorch/pytorch/issues/148014 for platform(s) inductor, linux, rocm. If you're seeing this on your local machine and would like to enable this test, please make sure CI is not set and you are not using the flag --import-disabled-tests.) [100%] 2025-12-04T15:08:26.9464162Z 2025-12-04T15:08:26.9464411Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_c10d_functional_native/distributed.test_c10d_functional_native-35d814276547fc24.xml - 2025-12-04T15:08:26.9464765Z ================ 10 passed, 6 skipped, 17 deselected in 30.48s ================= 2025-12-04T15:08:26.9465087Z [W1204 15:08:26.665870748 ProcessGroup.cpp:367] Warning: At the time of process termination, there are still 2 unwaited collective calls. Please review your program to ensure that: 2025-12-04T15:08:26.9465448Z 1. c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective, 2025-12-04T15:08:26.9465811Z 2. c10d_functional.wait_tensor() is invoked on all output tensors of async_op=True torch.distributed collective called under `with allow_inflight_collective_as_graph_input_ctx():`, 2025-12-04T15:08:26.9466152Z before the output tensors of the collective are used. (function ~WorkRegistry) 2025-12-04T15:08:26.9466470Z The following tests failed and then succeeded when run in a new process['test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor'] 2025-12-04T15:08:26.9466694Z 2025-12-04T15:08:26.9466895Z FINISHED PRINTING LOG FILE of distributed/test_c10d_functional_native 1/1 (test/test-reports/distributed.test_c10d_functional_native_1.1_46b22b661a00e22b_.log) 2025-12-04T15:08:26.9467131Z 2025-12-04T15:08:26.9467264Z Finished distributed/test_c10d_functional_native 1/1 ... [2025-12-04 15:08:26.939286][2266661.875531531], took 33.22min 2025-12-04T15:08:26.9467770Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:08:26.9468160Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:08:26.9468374Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T15:08:26.9468551Z Uploading artifacts took 0.00 seconds 2025-12-04T15:08:26.9468748Z Running distributed/pipelining/test_backward 1/1 ... [2025-12-04 15:08:26.941248][2266661.877499198] 2025-12-04T15:08:26.9468948Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:08:26.9469356Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/pipelining/test_backward.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:08:26.941439] 2025-12-04T15:10:09.7593641Z 2025-12-04T15:10:09.7594155Z distributed/pipelining/test_backward 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.pipelining.test_backward_1.1_622a611b97d133ea_.log 2025-12-04T15:10:09.7596280Z Running 5 items in this shard: test/distributed/pipelining/test_backward.py::StageBackwardTestsCUDA::test_stage_backward_cuda, test/distributed/pipelining/test_backward.py::StageBackwardTestsCUDA::test_stage_backward_input_cuda, test/distributed/pipelining/test_backward.py::StageBackwardTestsCUDA::test_stage_backward_weight_cuda, test/distributed/pipelining/test_backward.py::StageBackwardTestsCUDA::test_stage_backward_weight_grad_validation_cuda, test/distributed/pipelining/test_backward.py::StageBackwardTestsCUDA::test_stage_backward_weight_multiple_iters_cuda 2025-12-04T15:10:09.7597691Z 2025-12-04T15:10:09.7598082Z Finished distributed/pipelining/test_backward 1/1 ... [2025-12-04 15:10:09.758447][2266764.694693444], took 1.71min 2025-12-04T15:10:09.7598773Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:10:09.7604619Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:10:09.7606267Z Running distributed/test_nvshmem_triton 1/1 ... [2025-12-04 15:10:09.760522][2266764.696773038] 2025-12-04T15:10:09.7606529Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:10:09.7608394Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_nvshmem_triton.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:10:09.760715] 2025-12-04T15:10:15.3851823Z 2025-12-04T15:10:15.3853314Z distributed/test_nvshmem_triton 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.test_nvshmem_triton_1.1_50369725900d89dd_.log 2025-12-04T15:10:15.3854027Z 2025-12-04T15:10:15.3854369Z Finished distributed/test_nvshmem_triton 1/1 ... [2025-12-04 15:10:15.384772][2266770.320997226], took 0.09min 2025-12-04T15:10:15.3859385Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:10:15.3872083Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:10:15.3872572Z Running distributed/tensor/test_dtensor 1/3 ... [2025-12-04 15:10:15.387105][2266770.323356445] 2025-12-04T15:10:15.3872952Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:10:15.3874981Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_dtensor.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:10:15.387312] 2025-12-04T15:25:15.1159414Z 2025-12-04T15:25:15.1160595Z distributed/tensor/test_dtensor 1/3 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_dtensor_1.3_cea6dd724b93eccd_.log 2025-12-04T15:25:15.1168914Z Running 28 items in this shard: test/distributed/tensor/test_dtensor.py::DTensorTest::test_dtensor_new_empty_strided, test/distributed/tensor/test_dtensor.py::DTensorTest::test_dtensor_save_load, test/distributed/tensor/test_dtensor.py::DTensorTest::test_from_local_uneven_sharding, test/distributed/tensor/test_dtensor.py::DTensorTest::test_full_tensor_grad_hint, test/distributed/tensor/test_dtensor.py::DTensorTest::test_shard_tensor_2d, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_dtensor_properties, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_dtensor_stride, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_from_local_uneven_sharding, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_from_local_uneven_sharding_raise_error, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_meta_dtensor, test/distributed/tensor/test_dtensor.py::DTensorTestWithLocalTensor::test_modules_w_meta_dtensor, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_device_mesh_nd, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_dtensor_api_device_mesh_context_manager, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_dtensor_cond, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_from_local_sub_mesh, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_implicit_replication, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_inplace_on_local_tensor_view, test/distributed/tensor/test_dtensor.py::DTensorMeshTest::test_redistribute_sub_mesh, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_default_value_sub_mesh, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_dtensor_2d_mesh, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_dtensor_api_device_mesh_context_manager, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_dtensor_cond, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_dtensor_device_mesh_device_conversion, test/distributed/tensor/test_dtensor.py::DTensorMeshTestWithLocalTensor::test_from_local_sub_mesh, test/distributed/tensor/test_dtensor.py::TestDTensorSpec::test_dtensor_spec_default_shard_order_generation, test/distributed/tensor/test_dtensor.py::TestDTensorSpec::test_dtensor_spec_update, test/distributed/tensor/test_dtensor.py::TestDTensorSpecWithLocalTensor::test_dtensor_spec_print, test/distributed/tensor/test_dtensor.py::TestDTensorSpecWithLocalTensor::test_dtensor_spec_with_invalid_shard_order 2025-12-04T15:25:15.1175235Z 2025-12-04T15:25:15.1175449Z Finished distributed/tensor/test_dtensor 1/3 ... [2025-12-04 15:25:15.116436][2267670.052681095], took 15.00min 2025-12-04T15:25:15.1176147Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:25:15.1187004Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:25:15.1189792Z Running distributed/test_cupy_as_tensor 1/1 ... [2025-12-04 15:25:15.118883][2267670.055133921] 2025-12-04T15:25:15.1190024Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:25:15.1192067Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_cupy_as_tensor.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:25:15.119097] 2025-12-04T15:25:20.3422123Z 2025-12-04T15:25:20.3423182Z distributed/test_cupy_as_tensor 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.test_cupy_as_tensor_1.1_7ddafe1ce24b90a1_.log 2025-12-04T15:25:20.3424014Z Running 1 items in this shard: test/distributed/test_cupy_as_tensor.py::CupyAsTensorTest::test_cupy_as_tensor 2025-12-04T15:25:20.3424338Z 2025-12-04T15:25:20.3424566Z Finished distributed/test_cupy_as_tensor 1/1 ... [2025-12-04 15:25:20.341931][2267675.27817509], took 0.09min 2025-12-04T15:25:20.3433288Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:25:20.3445244Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:25:20.3448475Z Running distributed/fsdp/test_fsdp_fx 1/1 ... [2025-12-04 15:25:20.344677][2267675.280928119] 2025-12-04T15:25:20.3448784Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:25:20.3450805Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/fsdp/test_fsdp_fx.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:25:20.344900] 2025-12-04T15:25:22.7133928Z 2025-12-04T15:25:22.7134962Z distributed/fsdp/test_fsdp_fx 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.fsdp.test_fsdp_fx_1.1_b156e31897ab54ec_.log 2025-12-04T15:25:22.7136322Z Running 1 items in this shard: test/distributed/fsdp/test_fsdp_fx.py::TestSymbolicTracingCUDA::test_symbolic_tracing_outputs_cuda 2025-12-04T15:25:22.7136725Z 2025-12-04T15:25:22.7136967Z Finished distributed/fsdp/test_fsdp_fx 1/1 ... [2025-12-04 15:25:22.713131][2267677.64937741], took 0.04min 2025-12-04T15:25:22.7142174Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:25:22.7155362Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:25:22.7157874Z Running distributed/_tools/test_sac_ilp 1/1 ... [2025-12-04 15:25:22.715702][2267677.651952363] 2025-12-04T15:25:22.7158209Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:25:22.7160455Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_tools/test_sac_ilp.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:25:22.715927] 2025-12-04T15:25:27.0873487Z 2025-12-04T15:25:27.0874421Z distributed/_tools/test_sac_ilp 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed._tools.test_sac_ilp_1.1_4cd6e4df2aed90ea_.log 2025-12-04T15:25:27.0875289Z Running 4 items in this shard: test/distributed/_tools/test_sac_ilp.py::TestSACILP::test_sac_ilp_case1, test/distributed/_tools/test_sac_ilp.py::TestSACILP::test_sac_ilp_case2, test/distributed/_tools/test_sac_ilp.py::TestSACILP::test_sac_ilp_case3, test/distributed/_tools/test_sac_ilp.py::TestOptimalCheckpointingPolicy::test_get_optimial_checkpointing_policy_per_module 2025-12-04T15:25:27.0875952Z 2025-12-04T15:25:27.0876084Z Finished distributed/_tools/test_sac_ilp 1/1 ... [2025-12-04 15:25:27.087176][2267682.023419805], took 0.07min 2025-12-04T15:25:27.0881160Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:25:27.0892729Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:25:27.0895559Z Running distributed/checkpoint/test_hf_storage 1/1 ... [2025-12-04 15:25:27.089494][2267682.025744214] 2025-12-04T15:25:27.0895791Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:25:27.0897988Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_hf_storage.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:25:27.089704] 2025-12-04T15:25:29.4079178Z 2025-12-04T15:25:29.4080089Z distributed/checkpoint/test_hf_storage 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_hf_storage_1.1_c28239992cd18a3b_.log 2025-12-04T15:25:29.4081113Z Running 5 items in this shard: test/distributed/checkpoint/test_hf_storage.py::TestHfStorage::test_read_data_hf, test/distributed/checkpoint/test_hf_storage.py::TestHfStorage::test_read_metadata_hf, test/distributed/checkpoint/test_hf_storage.py::TestHfStorage::test_write_data_hf, test/distributed/checkpoint/test_hf_storage.py::TestHfStorage::test_write_data_with_sharding, test/distributed/checkpoint/test_hf_storage.py::TestHfStorage::test_write_metadata_hf 2025-12-04T15:25:29.4081809Z 2025-12-04T15:25:29.4081954Z Finished distributed/checkpoint/test_hf_storage 1/1 ... [2025-12-04 15:25:29.407654][2267684.343900784], took 0.04min 2025-12-04T15:25:29.4086092Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:25:29.4099671Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:25:29.4101928Z Running distributed/pipelining/test_microbatch 1/1 ... [2025-12-04 15:25:29.409910][2267684.346159934] 2025-12-04T15:25:29.4102159Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:25:29.4102591Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/pipelining/test_microbatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:25:29.410136] 2025-12-04T15:26:12.0939187Z 2025-12-04T15:26:12.0940832Z distributed/pipelining/test_microbatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.pipelining.test_microbatch_1.1_d794789e34ace8cc_.log 2025-12-04T15:26:12.0942648Z Running 5 items in this shard: test/distributed/pipelining/test_microbatch.py::MicrobatchTestsCUDA::test_chunk_spec_cuda, test/distributed/pipelining/test_microbatch.py::MicrobatchTestsCUDA::test_split_and_merge_cuda, test/distributed/pipelining/test_microbatch.py::MicrobatchTestsCUDA::test_split_block_mask_batch_size_one_cuda, test/distributed/pipelining/test_microbatch.py::MicrobatchTestsCUDA::test_split_block_mask_cuda, test/distributed/pipelining/test_microbatch.py::MicrobatchTestsCUDA::test_split_block_mask_none_cuda 2025-12-04T15:26:12.0943976Z 2025-12-04T15:26:12.0944252Z Finished distributed/pipelining/test_microbatch 1/1 ... [2025-12-04 15:26:12.093569][2267727.029815357], took 0.71min 2025-12-04T15:26:12.0945017Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:26:12.0956173Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:26:12.0960765Z Running distributed/tensor/test_placement_types 1/1 ... [2025-12-04 15:26:12.095789][2267727.032039568] 2025-12-04T15:26:12.0961065Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:26:12.0962054Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_placement_types.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:26:12.096017] 2025-12-04T15:26:14.2139209Z 2025-12-04T15:26:14.2140467Z distributed/tensor/test_placement_types 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_placement_types_1.1_e066e777fe5adc64_.log 2025-12-04T15:26:14.2143014Z Running 5 items in this shard: test/distributed/tensor/test_placement_types.py::PlacementTypesTestCase::test_dynamo_can_identify_placement_classes, test/distributed/tensor/test_placement_types.py::PlacementTypesTestCase::test_equality, test/distributed/tensor/test_placement_types.py::PlacementTypesTestCase::test_strided_shard_isinstance_shard, test/distributed/tensor/test_placement_types.py::PlacementTypesTestCase::test_strided_shard_kwonly_argument, test/distributed/tensor/test_placement_types.py::PlacementTypesTestCase::test_type_identification 2025-12-04T15:26:14.2144874Z 2025-12-04T15:26:14.2145198Z Finished distributed/tensor/test_placement_types 1/1 ... [2025-12-04 15:26:14.213681][2267729.149926756], took 0.04min 2025-12-04T15:26:14.2150419Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:26:14.2162081Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:26:14.2164403Z Running distributed/tensor/test_dtensor_dispatch_overhead 1/1 ... [2025-12-04 15:26:14.216329][2267729.152579577] 2025-12-04T15:26:14.2164775Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:26:14.2167064Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_dtensor_dispatch_overhead.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:26:14.216546] 2025-12-04T15:29:24.4178348Z 2025-12-04T15:29:24.4179783Z distributed/tensor/test_dtensor_dispatch_overhead 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_dtensor_dispatch_overhead_1.1_a570002797d2bfa6_.log 2025-12-04T15:29:24.4182025Z Running 1 items in this shard: test/distributed/tensor/test_dtensor_dispatch_overhead.py::DistOpDispatchOverHead::test_dtensor_add_op_dispatch_overhead 2025-12-04T15:29:24.4182800Z 2025-12-04T15:29:24.4183334Z Finished distributed/tensor/test_dtensor_dispatch_overhead 1/1 ... [2025-12-04 15:29:24.417375][2267919.353623527], took 3.17min 2025-12-04T15:29:24.4184631Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:29:24.4191828Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:29:24.4192284Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T15:29:24.4192628Z Uploading artifacts took 0.00 seconds 2025-12-04T15:29:24.4193668Z Running distributed/checkpoint/_experimental/test_checkpoint_reader 1/1 ... [2025-12-04 15:29:24.419226][2267919.355477494] 2025-12-04T15:29:24.4193953Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:29:24.4195318Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/_experimental/test_checkpoint_reader.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:29:24.419421] 2025-12-04T15:29:26.8372350Z 2025-12-04T15:29:26.8373859Z distributed/checkpoint/_experimental/test_checkpoint_reader 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint._experimental.test_checkpoint_reader_1.1_74914e620ccd8ae1_.log 2025-12-04T15:29:26.8378499Z Running 7 items in this shard: test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_partial_read, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_partial_read_different_dtypes, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_partial_read_missing_keys, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_read_checkpoint, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_read_nonexistent_checkpoint, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_read_with_kwargs, test/distributed/checkpoint/_experimental/test_checkpoint_reader.py::TestCheckpointReader::test_read_with_map_location 2025-12-04T15:29:26.8381202Z 2025-12-04T15:29:26.8381529Z Finished distributed/checkpoint/_experimental/test_checkpoint_reader 1/1 ... [2025-12-04 15:29:26.836837][2267921.773084611], took 0.04min 2025-12-04T15:29:26.8382426Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:29:26.8386116Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:29:26.8388187Z Running distributed/checkpoint/test_format_utils 1/1 ... [2025-12-04 15:29:26.838701][2267921.774951868] 2025-12-04T15:29:26.8388593Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:29:26.8390280Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_format_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:29:26.838905] 2025-12-04T15:31:59.8376423Z 2025-12-04T15:31:59.8377406Z distributed/checkpoint/test_format_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_format_utils_1.1_5ac5c246a881b84d_.log 2025-12-04T15:31:59.8378817Z Running 3 items in this shard: test/distributed/checkpoint/test_format_utils.py::TestFormatUtils::test_dcp_to_torch_save, test/distributed/checkpoint/test_format_utils.py::TestFormatUtils::test_online_torch_save_to_dcp, test/distributed/checkpoint/test_format_utils.py::TestFormatUtils::test_torch_save_to_dcp 2025-12-04T15:31:59.8379848Z 2025-12-04T15:31:59.8380192Z Finished distributed/checkpoint/test_format_utils 1/1 ... [2025-12-04 15:31:59.837423][2268074.773670234], took 2.55min 2025-12-04T15:31:59.8385474Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T15:31:59.8395573Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T15:31:59.8398140Z Running distributed/test_aten_comm_compute_reordering 1/3 ... [2025-12-04 15:31:59.839717][2268074.775967554] 2025-12-04T15:31:59.8398490Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T15:31:59.8400488Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_aten_comm_compute_reordering.py', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 15:31:59.839927] 2025-12-04T16:10:08.2969176Z 2025-12-04T16:10:08.2970090Z PRINTING LOG FILE of distributed/test_aten_comm_compute_reordering 1/3 (test/test-reports/distributed.test_aten_comm_compute_reordering_1.3_45888d641466d241_.log) 2025-12-04T16:10:08.2971045Z Test results will be stored in test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-fd5edab33c05981b.xml 2025-12-04T16:10:08.2971671Z ============================= test session starts ============================== 2025-12-04T16:10:08.2972096Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:10:08.2972460Z cachedir: .pytest_cache 2025-12-04T16:10:08.2972894Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:10:08.2973361Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:10:08.2973586Z configfile: pytest.ini 2025-12-04T16:10:08.2974036Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:10:08.2974499Z collecting ... collected 48 items 2025-12-04T16:10:08.2974764Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T16:10:08.2981772Z Running 19 items in this shard: test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_custom_estimator_for_non_compute_nodes, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_inductor_default_comms_ordering, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_sink_waits, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_sink_waits_raise_comms, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucket_exposed_with_hidden_single_overlap, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap_blocking_deps_inductor, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap_blocking_no_deps, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_collective_benchmarking_with_real_pg, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_grouped_scheduler_node, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_inductor_default_comms_ordering, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_multiple_hiding_nodes_bucketing, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_overlap_scheduling_via_config, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits, test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits_raise_comms, test/distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_inductor_default_comms_ordering, test/distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_make_graph_view_and_get_subgraph_by_path, test/distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_reorder_compute_for_overlap_mul, test/distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_sink_waits 2025-12-04T16:10:08.2987083Z 2025-12-04T16:10:08.2987679Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_custom_estimator_for_non_compute_nodes I1204 15:32:04.723000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 184305 2025-12-04T16:10:08.2988368Z I1204 15:32:04.723000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 184306 2025-12-04T16:10:08.2988891Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2989256Z warn_once( 2025-12-04T16:10:08.2989590Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2989944Z warn_once( 2025-12-04T16:10:08.2990471Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.2991015Z warnings.warn( 2025-12-04T16:10:08.2991543Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.2992077Z warnings.warn( 2025-12-04T16:10:08.2992210Z PASSED [11.5202s] [ 5%] 2025-12-04T16:10:08.2992722Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_inductor_default_comms_ordering I1204 15:32:16.246000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 186027 2025-12-04T16:10:08.2993407Z I1204 15:32:16.247000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 186028 2025-12-04T16:10:08.2993914Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2994274Z warn_once( 2025-12-04T16:10:08.2994601Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2994954Z warn_once( 2025-12-04T16:10:08.2995123Z PASSED [8.0141s] [ 10%] 2025-12-04T16:10:08.2995503Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_sink_waits I1204 15:32:24.262000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 187763 2025-12-04T16:10:08.2996037Z I1204 15:32:24.263000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 187764 2025-12-04T16:10:08.2996441Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2996741Z warn_once( 2025-12-04T16:10:08.2997000Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.2997313Z warn_once( 2025-12-04T16:10:08.2997760Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.2998194Z warnings.warn( 2025-12-04T16:10:08.2998617Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.2999049Z warnings.warn( 2025-12-04T16:10:08.2999148Z PASSED [119.7503s] [ 15%] 2025-12-04T16:10:08.2999545Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingMultiProc::test_sink_waits_raise_comms I1204 15:34:24.014000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 189413 2025-12-04T16:10:08.3000063Z I1204 15:34:24.014000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 189414 2025-12-04T16:10:08.3000467Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3000749Z warn_once( 2025-12-04T16:10:08.3001010Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3001292Z warn_once( 2025-12-04T16:10:08.3001701Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3002136Z warnings.warn( 2025-12-04T16:10:08.3002552Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3002985Z warnings.warn( 2025-12-04T16:10:08.3003082Z PASSED [162.0051s] [ 21%] 2025-12-04T16:10:08.3003502Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucket_exposed_with_hidden_single_overlap I1204 15:37:06.020000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 191099 2025-12-04T16:10:08.3004045Z I1204 15:37:06.021000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 191100 2025-12-04T16:10:08.3004451Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3004732Z warn_once( 2025-12-04T16:10:08.3004987Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3005262Z warn_once( 2025-12-04T16:10:08.3005679Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3006105Z warnings.warn( 2025-12-04T16:10:08.3006514Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3006969Z warnings.warn( 2025-12-04T16:10:08.3007376Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:869: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3007862Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3008297Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:872: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3008722Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3009149Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:873: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3009574Z ag3 = _functional_collectives.all_gather_tensor(c, 0, ranks) 2025-12-04T16:10:08.3009998Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:869: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3010421Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3010845Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:872: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3011269Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3011698Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:873: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3012123Z ag3 = _functional_collectives.all_gather_tensor(c, 0, ranks) 2025-12-04T16:10:08.3012267Z PASSED [262.5454s] [ 26%] 2025-12-04T16:10:08.3012660Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap I1204 15:41:28.568000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 192455 2025-12-04T16:10:08.3013177Z I1204 15:41:28.568000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 192456 2025-12-04T16:10:08.3013577Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3013856Z warn_once( 2025-12-04T16:10:08.3014113Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3014388Z warn_once( 2025-12-04T16:10:08.3014791Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3015215Z warnings.warn( 2025-12-04T16:10:08.3015638Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3016061Z warnings.warn( 2025-12-04T16:10:08.3016491Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:800: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3016935Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3017360Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:800: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3017828Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3018250Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:801: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3018673Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3019095Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:802: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3019526Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3019958Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:801: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3020379Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3020799Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:803: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3021228Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3021657Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:802: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3022082Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3022509Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:803: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3022937Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3023086Z PASSED [176.8331s] [ 31%] 2025-12-04T16:10:08.3023510Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap_blocking_deps_inductor I1204 15:44:25.402000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 194197 2025-12-04T16:10:08.3024056Z I1204 15:44:25.402000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 194198 2025-12-04T16:10:08.3024455Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3024733Z warn_once( 2025-12-04T16:10:08.3025010Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3025289Z warn_once( 2025-12-04T16:10:08.3025691Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3026143Z warnings.warn( 2025-12-04T16:10:08.3026564Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3026992Z warnings.warn( 2025-12-04T16:10:08.3027379Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:921: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3027845Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3028273Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:922: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3028698Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3029120Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:923: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3029548Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3029978Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:924: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3030409Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3030837Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:921: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3031260Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3031681Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:922: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3032102Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3032528Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:923: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3032956Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3033384Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:924: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3033809Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3033957Z PASSED [227.7849s] [ 36%] 2025-12-04T16:10:08.3034394Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_bucketing_split_for_overlap_blocking_no_deps I1204 15:48:13.189000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 195903 2025-12-04T16:10:08.3034930Z I1204 15:48:13.189000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 195904 2025-12-04T16:10:08.3035344Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3035637Z warn_once( 2025-12-04T16:10:08.3035907Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3036184Z warn_once( 2025-12-04T16:10:08.3036594Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3037018Z warnings.warn( 2025-12-04T16:10:08.3037424Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3037909Z warnings.warn( 2025-12-04T16:10:08.3038290Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:740: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3038716Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3039143Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:741: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3039565Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3039989Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:742: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3040419Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3040851Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:743: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3041279Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3041705Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:740: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3042125Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3042547Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:741: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3042971Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3043392Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:742: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3043817Z ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks) 2025-12-04T16:10:08.3044258Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:743: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3044682Z ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks) 2025-12-04T16:10:08.3044843Z PASSED [190.5442s] [ 42%] 2025-12-04T16:10:08.3045265Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_collective_benchmarking_with_real_pg I1204 15:51:23.734000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 197609 2025-12-04T16:10:08.3045831Z I1204 15:51:23.734000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 197610 2025-12-04T16:10:08.3046228Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3046509Z warn_once( 2025-12-04T16:10:08.3046766Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3047042Z warn_once( 2025-12-04T16:10:08.3047449Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3047907Z warnings.warn( 2025-12-04T16:10:08.3048312Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3048735Z warnings.warn( 2025-12-04T16:10:08.3049276Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/comm_analysis.py:454: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`. 2025-12-04T16:10:08.3049837Z device = _get_pg_default_device(pg) 2025-12-04T16:10:08.3050411Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2771: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`. 2025-12-04T16:10:08.3050993Z device = device or _get_pg_default_device(group) 2025-12-04T16:10:08.3051565Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/comm_analysis.py:454: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`. 2025-12-04T16:10:08.3052120Z device = _get_pg_default_device(pg) 2025-12-04T16:10:08.3052689Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2771: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`. 2025-12-04T16:10:08.3053267Z device = device or _get_pg_default_device(group) 2025-12-04T16:10:08.3053665Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. 2025-12-04T16:10:08.3054034Z return func(*args, **kwargs) 2025-12-04T16:10:08.3054438Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1054: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3054932Z ag = _functional_collectives.all_gather_tensor( 2025-12-04T16:10:08.3055347Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1054: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3055760Z ag = _functional_collectives.all_gather_tensor( 2025-12-04T16:10:08.3055891Z PASSED [144.5963s] [ 47%] 2025-12-04T16:10:08.3056181Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_grouped_scheduler_node SKIPPED [0.0003s] (Logic not yet implemented) [ 52%] 2025-12-04T16:10:08.3056770Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_inductor_default_comms_ordering I1204 15:53:48.332000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 199301 2025-12-04T16:10:08.3057293Z I1204 15:53:48.333000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 199302 2025-12-04T16:10:08.3057731Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3058013Z warn_once( 2025-12-04T16:10:08.3058272Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3058549Z warn_once( 2025-12-04T16:10:08.3058641Z PASSED [7.9132s] [ 57%] 2025-12-04T16:10:08.3059036Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_multiple_hiding_nodes_bucketing I1204 15:53:56.247000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 201037 2025-12-04T16:10:08.3059562Z I1204 15:53:56.248000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 201038 2025-12-04T16:10:08.3059962Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3060244Z warn_once( 2025-12-04T16:10:08.3060501Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3060778Z warn_once( 2025-12-04T16:10:08.3061183Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3061610Z warnings.warn( 2025-12-04T16:10:08.3062020Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3062447Z warnings.warn( 2025-12-04T16:10:08.3062837Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1175: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3063273Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3063728Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1176: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3064171Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3064616Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1175: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3065057Z ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks) 2025-12-04T16:10:08.3065482Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:1176: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3065909Z ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks) 2025-12-04T16:10:08.3066055Z PASSED [185.3519s] [ 63%] 2025-12-04T16:10:08.3066451Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_overlap_scheduling_via_config I1204 15:57:01.601000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 202715 2025-12-04T16:10:08.3066972Z I1204 15:57:01.602000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 202716 2025-12-04T16:10:08.3067371Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3067704Z warn_once( 2025-12-04T16:10:08.3067961Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3068238Z warn_once( 2025-12-04T16:10:08.3068646Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3069072Z warnings.warn( 2025-12-04T16:10:08.3069483Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3069913Z warnings.warn( 2025-12-04T16:10:08.3070009Z PASSED [155.1058s] [ 68%] 2025-12-04T16:10:08.3070377Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits I1204 15:59:36.709000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 204365 2025-12-04T16:10:08.3070868Z I1204 15:59:36.709000 183649 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 204366 2025-12-04T16:10:08.3071266Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3071546Z warn_once( 2025-12-04T16:10:08.3071805Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3072082Z warn_once( 2025-12-04T16:10:08.3072486Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3072910Z warnings.warn( 2025-12-04T16:10:08.3073340Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3073800Z warnings.warn( 2025-12-04T16:10:08.3073916Z Exception in thread Thread-1 (_event_listener): 2025-12-04T16:10:08.3074196Z Traceback (most recent call last): 2025-12-04T16:10:08.3074405Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T16:10:08.3074608Z Exception in thread Thread-1 (_event_listener): 2025-12-04T16:10:08.3074748Z Traceback (most recent call last): 2025-12-04T16:10:08.3074930Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2025-12-04T16:10:08.3075106Z self.run() 2025-12-04T16:10:08.3075247Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T16:10:08.3075424Z self._target(*self._args, **self._kwargs) 2025-12-04T16:10:08.3075684Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T16:10:08.3075939Z event = parent_pipe.recv() 2025-12-04T16:10:08.3076126Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T16:10:08.3076313Z self.run() 2025-12-04T16:10:08.3076455Z File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953, in run 2025-12-04T16:10:08.3076621Z buf = self._recv_bytes() 2025-12-04T16:10:08.3076814Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T16:10:08.3077012Z buf = self._recv(4) 2025-12-04T16:10:08.3077187Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T16:10:08.3077387Z self._target(*self._args, **self._kwargs) 2025-12-04T16:10:08.3077685Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 879, in _event_listener 2025-12-04T16:10:08.3077931Z raise EOFError 2025-12-04T16:10:08.3078024Z EOFError 2025-12-04T16:10:08.3078118Z event = parent_pipe.recv() 2025-12-04T16:10:08.3078301Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 250, in recv 2025-12-04T16:10:08.3078494Z buf = self._recv_bytes() 2025-12-04T16:10:08.3078686Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes 2025-12-04T16:10:08.3078882Z buf = self._recv(4) 2025-12-04T16:10:08.3079057Z File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/connection.py", line 383, in _recv 2025-12-04T16:10:08.3079241Z raise EOFError 2025-12-04T16:10:08.3079332Z EOFError 2025-12-04T16:10:08.3079383Z 2025-12-04T16:10:08.3079385Z 2025-12-04T16:10:08.3079663Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-fd5edab33c05981b.xml - 2025-12-04T16:10:08.3080035Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T16:10:08.3080298Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py:1036: KeyboardInterrupt 2025-12-04T16:10:08.3080572Z (to show a full traceback on KeyboardInterrupt use --full-trace) 2025-12-04T16:10:08.3080754Z ================== 12 passed, 1 skipped in 1795.26s (0:29:55) ================== 2025-12-04T16:10:08.3080905Z Command took >30min, returning 124 2025-12-04T16:10:08.3081020Z Got exit code 124 2025-12-04T16:10:08.3081119Z Retrying single test... 2025-12-04T16:10:08.3081408Z Test results will be stored in test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-0165ca678555aafd.xml 2025-12-04T16:10:08.3081745Z ============================= test session starts ============================== 2025-12-04T16:10:08.3081955Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:10:08.3082143Z cachedir: .pytest_cache 2025-12-04T16:10:08.3082370Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:10:08.3082623Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:10:08.3082758Z configfile: pytest.ini 2025-12-04T16:10:08.3083002Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:10:08.3083276Z collecting ... collected 48 items / 18 deselected / 30 selected 2025-12-04T16:10:08.3083605Z stepcurrent: skipping 13 already run items. Running only test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits 2025-12-04T16:10:08.3083896Z Running 1 items in this shard 2025-12-04T16:10:08.3083971Z 2025-12-04T16:10:08.3084275Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits I1204 16:02:10.049000 206011 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 206667 2025-12-04T16:10:08.3084766Z I1204 16:02:10.050000 206011 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 206668 2025-12-04T16:10:08.3085170Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3085452Z warn_once( 2025-12-04T16:10:08.3085709Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3085986Z warn_once( 2025-12-04T16:10:08.3086398Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3086825Z warnings.warn( 2025-12-04T16:10:08.3087236Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3087702Z warnings.warn( 2025-12-04T16:10:08.3087799Z PASSED [11.0203s] [100%] 2025-12-04T16:10:08.3087864Z 2025-12-04T16:10:08.3088135Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-0165ca678555aafd.xml - 2025-12-04T16:10:08.3088502Z ====================== 1 passed, 18 deselected in 11.03s ======================= 2025-12-04T16:10:08.3088641Z Got exit code 0 2025-12-04T16:10:08.3088778Z Test succeeded in new process, continuing with the rest of the tests 2025-12-04T16:10:08.3089117Z Test results will be stored in test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-38c14026fb291828.xml 2025-12-04T16:10:08.3089431Z ============================= test session starts ============================== 2025-12-04T16:10:08.3089640Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:10:08.3089828Z cachedir: .pytest_cache 2025-12-04T16:10:08.3090048Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:10:08.3090286Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:10:08.3090403Z configfile: pytest.ini 2025-12-04T16:10:08.3090655Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:10:08.3090930Z collecting ... collected 48 items / 14 deselected / 34 selected 2025-12-04T16:10:08.3091093Z stepcurrent: skipping 14 already run items. 2025-12-04T16:10:08.3091223Z Running 5 items in this shard 2025-12-04T16:10:08.3091307Z 2025-12-04T16:10:08.3091648Z distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits_raise_comms I1204 16:02:26.676000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 208977 2025-12-04T16:10:08.3092171Z I1204 16:02:26.677000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 208978 2025-12-04T16:10:08.3092569Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3092848Z warn_once( 2025-12-04T16:10:08.3093105Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3093382Z warn_once( 2025-12-04T16:10:08.3093789Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3094217Z warnings.warn( 2025-12-04T16:10:08.3094629Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3095053Z warnings.warn( 2025-12-04T16:10:08.3095148Z PASSED [151.2027s] [ 20%] 2025-12-04T16:10:08.3095532Z distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_inductor_default_comms_ordering I1204 16:04:57.882000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 210663 2025-12-04T16:10:08.3096035Z I1204 16:04:57.882000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 210664 2025-12-04T16:10:08.3096432Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3096713Z warn_once( 2025-12-04T16:10:08.3096969Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3097244Z warn_once( 2025-12-04T16:10:08.3097334Z PASSED [7.9145s] [ 40%] 2025-12-04T16:10:08.3097756Z distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_make_graph_view_and_get_subgraph_by_path I1204 16:05:05.798000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 212399 2025-12-04T16:10:08.3098266Z I1204 16:05:05.799000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 212400 2025-12-04T16:10:08.3098495Z PASSED [7.0135s] [ 60%] 2025-12-04T16:10:08.3098882Z distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_reorder_compute_for_overlap_mul I1204 16:05:12.813000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 213713 2025-12-04T16:10:08.3099383Z I1204 16:05:12.814000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 213714 2025-12-04T16:10:08.3099778Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3100057Z warn_once( 2025-12-04T16:10:08.3100327Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3100605Z warn_once( 2025-12-04T16:10:08.3101013Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3101467Z warnings.warn( 2025-12-04T16:10:08.3101887Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3102311Z warnings.warn( 2025-12-04T16:10:08.3102701Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:288: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3103135Z ar = _functional_collectives.all_reduce(a, "sum", ranks, tag) 2025-12-04T16:10:08.3103564Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:288: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3103993Z ar = _functional_collectives.all_reduce(a, "sum", ranks, tag) 2025-12-04T16:10:08.3104419Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:293: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3104844Z fr = _functional_collectives.all_reduce(f, "sum", ranks, tag) 2025-12-04T16:10:08.3105269Z /var/lib/jenkins/pytorch/test/distributed/test_aten_comm_compute_reordering.py:293: FutureWarning: The combination of ranks + tag as process group identifier has been deprecated. Please switch to using ProcessGroup, DeviceMesh, or group name instead. 2025-12-04T16:10:08.3105693Z fr = _functional_collectives.all_reduce(f, "sum", ranks, tag) 2025-12-04T16:10:08.3105840Z PASSED [80.6130s] [ 80%] 2025-12-04T16:10:08.3106197Z distributed/test_aten_comm_compute_reordering.py::TestManualOverlapBucketing::test_sink_waits I1204 16:06:33.428000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 215435 2025-12-04T16:10:08.3106672Z I1204 16:06:33.429000 208321 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 215436 2025-12-04T16:10:08.3107070Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3107350Z warn_once( 2025-12-04T16:10:08.3107664Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/pgo.py:539: UserWarning: dynamo_pgo force disabled by torch.compiler.config.force_disable_caches 2025-12-04T16:10:08.3107940Z warn_once( 2025-12-04T16:10:08.3108345Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3108772Z warnings.warn( 2025-12-04T16:10:08.3109179Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T16:10:08.3109606Z warnings.warn( 2025-12-04T16:10:08.3109701Z PASSED [213.6778s] [100%] 2025-12-04T16:10:08.3109767Z 2025-12-04T16:10:08.3110057Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_aten_comm_compute_reordering/distributed.test_aten_comm_compute_reordering-38c14026fb291828.xml - 2025-12-04T16:10:08.3110427Z ================= 5 passed, 14 deselected in 460.44s (0:07:40) ================= 2025-12-04T16:10:08.3110800Z The following tests failed and then succeeded when run in a new process['test/distributed/test_aten_comm_compute_reordering.py::TestComputeCommReorderingBucketing::test_sink_waits'] 2025-12-04T16:10:08.3111082Z 2025-12-04T16:10:08.3111300Z FINISHED PRINTING LOG FILE of distributed/test_aten_comm_compute_reordering 1/3 (test/test-reports/distributed.test_aten_comm_compute_reordering_1.3_45888d641466d241_.log) 2025-12-04T16:10:08.3111547Z 2025-12-04T16:10:08.3111690Z Finished distributed/test_aten_comm_compute_reordering 1/3 ... [2025-12-04 16:10:08.296867][2270363.233112401], took 38.14min 2025-12-04T16:10:08.3112132Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:10:08.3112525Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:10:08.3112744Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T16:10:08.3112924Z Uploading artifacts took 0.00 seconds 2025-12-04T16:10:08.3113099Z Running distributed/test_p2p_ipc 1/1 ... [2025-12-04 16:10:08.299354][2270363.235605097] 2025-12-04T16:10:08.3113281Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:10:08.3113662Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_p2p_ipc.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:10:08.299561] 2025-12-04T16:10:13.0217000Z 2025-12-04T16:10:13.0217871Z distributed/test_p2p_ipc 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.test_p2p_ipc_1.1_38d71492eee292e0_.log 2025-12-04T16:10:13.0218300Z Running 1 items in this shard: test/distributed/test_p2p_ipc.py::P2PIpcTest::test_p2p_ipc 2025-12-04T16:10:13.0218449Z 2025-12-04T16:10:13.0218575Z Finished distributed/test_p2p_ipc 1/1 ... [2025-12-04 16:10:13.021427][2270367.957675468], took 0.08min 2025-12-04T16:10:13.0223427Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:10:13.0233623Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:10:13.0234386Z Running distributed/tensor/test_common_rules 1/1 ... [2025-12-04 16:10:13.023318][2270367.959569799] 2025-12-04T16:10:13.0234603Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:10:13.0236732Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_common_rules.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:10:13.023514] 2025-12-04T16:10:17.5456217Z 2025-12-04T16:10:17.5457254Z distributed/tensor/test_common_rules 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_common_rules_1.1_a5301b93dba34be0_.log 2025-12-04T16:10:17.5461972Z Running 10 items in this shard: test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_basic_propagation, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_errors, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_linearity, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_merge_sharding, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_multi_sharding_on_mesh_dim, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_einop_pointwise_propagation, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_pointwise_enforce_sharding_multi_sharding_on_mesh_dim, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_pointwise_multi_sharding_on_mesh_dim, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_pointwise_rules_broadcasting, test/distributed/tensor/test_common_rules.py::CommonRulesTest::test_pointwise_rules_suggestion 2025-12-04T16:10:17.5465101Z 2025-12-04T16:10:17.5465429Z Finished distributed/tensor/test_common_rules 1/1 ... [2025-12-04 16:10:17.545332][2270372.481578202], took 0.08min 2025-12-04T16:10:17.5466936Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:10:17.5478972Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:10:17.5481072Z Running distributed/checkpoint/test_hf_safetensor_e2e 1/1 ... [2025-12-04 16:10:17.548007][2270372.484258553] 2025-12-04T16:10:17.5481410Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:10:17.5483609Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_hf_safetensor_e2e.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:10:17.548227] 2025-12-04T16:27:23.3222658Z 2025-12-04T16:27:23.3223339Z distributed/checkpoint/test_hf_safetensor_e2e 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_hf_safetensor_e2e_1.1_993bdebacf7c10b6_.log 2025-12-04T16:27:23.3227188Z Running 11 items in this shard: test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestSingleRankSaveLoad::test_load, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestSingleRankSaveLoad::test_load_into_empty_dict, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestSingleRankSaveLoad::test_load_with_multiple_threads, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestSingleRankSaveLoad::test_quantized_checkpoint_loading, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestSingleRankSaveLoad::test_save, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDistributedHFSafetensorsConsolidation::test_consolidate_to_one_file, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDTensorReshardPlacementChange::test_1d_to_1d_reshard_placement_change, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDTensorReshardPlacementChange::test_2d_to_2d_reshard_placement_change, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDTensorReshardMeshChange::test_1d_to_2d_reshard_mesh_change, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDTensorReshardMeshChange::test_2d_to_1d_reshard_mesh_change, test/distributed/checkpoint/test_hf_safetensor_e2e.py::TestDTensorReshardMeshChange::test_dtensor_checkpoint_resharding_with_empty_shard 2025-12-04T16:27:23.3229186Z 2025-12-04T16:27:23.3229352Z Finished distributed/checkpoint/test_hf_safetensor_e2e 1/1 ... [2025-12-04 16:27:23.321967][2271398.258213257], took 17.10min 2025-12-04T16:27:23.3231708Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:27:23.3243026Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:27:23.3245067Z Running distributed/_tools/test_sac_estimator 1/1 ... [2025-12-04 16:27:23.324415][2271398.260666095] 2025-12-04T16:27:23.3245507Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:27:23.3248145Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_tools/test_sac_estimator.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:27:23.324613] 2025-12-04T16:27:38.4615444Z 2025-12-04T16:27:38.4616270Z distributed/_tools/test_sac_estimator 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed._tools.test_sac_estimator_1.1_6715ddd0ee930000_.log 2025-12-04T16:27:38.4618012Z Running 2 items in this shard: test/distributed/_tools/test_sac_estimator.py::TestSACEstimator::test_simple_model_sac_estimation, test/distributed/_tools/test_sac_estimator.py::TestSACEstimator::test_transformer_sac_estimation 2025-12-04T16:27:38.4618846Z 2025-12-04T16:27:38.4619095Z Finished distributed/_tools/test_sac_estimator 1/1 ... [2025-12-04 16:27:38.461268][2271413.397515132], took 0.25min 2025-12-04T16:27:38.4625565Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:27:38.4638266Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:27:38.4639721Z Running distributed/_tools/test_memory_tracker 1/1 ... [2025-12-04 16:27:38.463776][2271413.400027378] 2025-12-04T16:27:38.4640034Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:27:38.4641457Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_tools/test_memory_tracker.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:27:38.463984] 2025-12-04T16:27:44.4878656Z 2025-12-04T16:27:44.4879965Z distributed/_tools/test_memory_tracker 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed._tools.test_memory_tracker_1.1_c0143a5973912ff9_.log 2025-12-04T16:27:44.4881107Z Running 1 items in this shard: test/distributed/_tools/test_memory_tracker.py::TestMemoryTracker::test_local_model 2025-12-04T16:27:44.4881531Z 2025-12-04T16:27:44.4881860Z Finished distributed/_tools/test_memory_tracker 1/1 ... [2025-12-04 16:27:44.487562][2271419.423808115], took 0.10min 2025-12-04T16:27:44.4890212Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:27:44.4900867Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:27:44.4902953Z Running distributed/checkpoint/_experimental/test_builder 1/1 ... [2025-12-04 16:27:44.490197][2271419.426448208] 2025-12-04T16:27:44.4903368Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:27:44.4905497Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/_experimental/test_builder.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:27:44.490398] 2025-12-04T16:27:48.6123790Z 2025-12-04T16:27:48.6125274Z distributed/checkpoint/_experimental/test_builder 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint._experimental.test_builder_1.1_e0ec276ec6c4e0af_.log 2025-12-04T16:27:48.6128415Z Running 4 items in this shard: test/distributed/checkpoint/_experimental/test_builder.py::TestMakeCheckpointer::test_make_async_checkpointer, test/distributed/checkpoint/_experimental/test_builder.py::TestMakeCheckpointer::test_make_sync_checkpointer, test/distributed/checkpoint/_experimental/test_builder.py::TestMakeCheckpointer::test_make_sync_checkpointer_with_config_first, test/distributed/checkpoint/_experimental/test_builder.py::TestMakeCheckpointer::test_make_sync_checkpointer_with_custom_config 2025-12-04T16:27:48.6130439Z 2025-12-04T16:27:48.6130712Z Finished distributed/checkpoint/_experimental/test_builder 1/1 ... [2025-12-04 16:27:48.611965][2271423.548211823], took 0.07min 2025-12-04T16:27:48.6134931Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:27:48.6146720Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:27:48.6147904Z Running distributed/_composable/test_replicate_with_fsdp 1/1 ... [2025-12-04 16:27:48.614623][2271423.550874455] 2025-12-04T16:27:48.6148318Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:27:48.6150719Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_composable/test_replicate_with_fsdp.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:27:48.614847] 2025-12-04T16:49:50.5049109Z 2025-12-04T16:49:50.5050315Z PRINTING LOG FILE of distributed/_composable/test_replicate_with_fsdp 1/1 (test/test-reports/distributed._composable.test_replicate_with_fsdp_1.1_f5ca0749d20b4f7e_.log) 2025-12-04T16:49:50.5051241Z Test results will be stored in test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-bc194f5ff6fd8eb4.xml 2025-12-04T16:49:50.5051797Z ============================= test session starts ============================== 2025-12-04T16:49:50.5052175Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:49:50.5052489Z cachedir: .pytest_cache 2025-12-04T16:49:50.5052876Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:49:50.5053267Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:49:50.5053474Z configfile: pytest.ini 2025-12-04T16:49:50.5053852Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:49:50.5054257Z collecting ... collected 5 items 2025-12-04T16:49:50.5054501Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T16:49:50.5055763Z Running 5 items in this shard: test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_tp_device_mesh, test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_transformer, test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_transformer_managed_modules, test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_parity_2d_mlp, test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp 2025-12-04T16:49:50.5056890Z 2025-12-04T16:49:50.5057317Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_tp_device_mesh I1204 16:27:50.241000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 220263 2025-12-04T16:49:50.5058070Z I1204 16:27:50.242000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 220264 2025-12-04T16:49:50.5058568Z I1204 16:27:50.242000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 220265 2025-12-04T16:49:50.5059050Z I1204 16:27:50.243000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 220266 2025-12-04T16:49:50.5059388Z PASSED [2.5932s] [ 20%] 2025-12-04T16:49:50.5059895Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_transformer I1204 16:27:52.654000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 220571 2025-12-04T16:49:50.5060568Z I1204 16:27:52.655000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 220572 2025-12-04T16:49:50.5061052Z I1204 16:27:52.655000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 220573 2025-12-04T16:49:50.5062027Z I1204 16:27:52.656000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 220574 2025-12-04T16:49:50.5062716Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. 2025-12-04T16:49:50.5063385Z return func(*args, **kwargs) 2025-12-04T16:49:50.5063930Z [rank0]:[W1204 16:27:54.365872614 ProcessGroupNCCL.cpp:5138] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group() 2025-12-04T16:49:50.5064468Z PASSED [128.3777s] [ 40%] 2025-12-04T16:49:50.5064973Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_replicate_transformer_managed_modules I1204 16:30:01.033000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 220888 2025-12-04T16:49:50.5065533Z I1204 16:30:01.033000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 220889 2025-12-04T16:49:50.5065911Z I1204 16:30:01.034000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 220890 2025-12-04T16:49:50.5066298Z I1204 16:30:01.034000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 220891 2025-12-04T16:49:50.5066563Z PASSED [2.5073s] [ 60%] 2025-12-04T16:49:50.5067029Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_parity_2d_mlp I1204 16:30:03.541000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 221172 2025-12-04T16:49:50.5067586Z I1204 16:30:03.542000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 221173 2025-12-04T16:49:50.5067966Z I1204 16:30:03.542000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 221174 2025-12-04T16:49:50.5068351Z I1204 16:30:03.543000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 221175 2025-12-04T16:49:50.5068886Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning. 2025-12-04T16:49:50.5069310Z return func(*args, **kwargs) 2025-12-04T16:49:50.5069675Z [rank0]:[W1204 16:34:24.319781231 ProcessGroupNCCL.cpp:5138] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group() 2025-12-04T16:49:50.5070043Z PASSED [278.7785s] [ 80%] 2025-12-04T16:49:50.5070434Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp I1204 16:34:42.322000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 221585 2025-12-04T16:49:50.5070964Z I1204 16:34:42.322000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 221586 2025-12-04T16:49:50.5071345Z I1204 16:34:42.323000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 221587 2025-12-04T16:49:50.5071732Z I1204 16:34:42.323000 220195 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 221588 2025-12-04T16:49:50.5072154Z [rank0]:I1204 16:39:42.420000 221585 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T16:49:50.5072599Z [rank2]:I1204 16:39:42.420000 221587 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T16:49:50.5073038Z [rank1]:I1204 16:39:42.420000 221586 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T16:49:50.5073732Z [rank3]:I1204 16:39:42.420000 221588 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T16:49:50.5074141Z [rank2]:I1204 16:39:42.421000 221587 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T16:49:50.5074529Z [rank0]:I1204 16:39:42.421000 221585 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T16:49:50.5074911Z [rank1]:I1204 16:39:42.421000 221586 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T16:49:50.5075247Z [rank3]:I1204 16:39:42.421000 221588 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T16:49:50.5075596Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T16:49:50.5075913Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5076237Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007714c29fe640 (most recent call first): 2025-12-04T16:49:50.5076588Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5076872Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5077195Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007714f0dfe640 (most recent call first): 2025-12-04T16:49:50.5077582Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5077860Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5078180Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007714c33ff640 (most recent call first): 2025-12-04T16:49:50.5078521Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5078797Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5079116Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007714f17ff640 (most recent call first): 2025-12-04T16:49:50.5079461Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5079741Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5080075Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000772d3a4fe640 (most recent call first): 2025-12-04T16:49:50.5080599Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5081139Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5081618Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5082097Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5082458Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5082801Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000077372daac740 (most recent call first): 2025-12-04T16:49:50.5083294Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5083905Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5084516Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5085124Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5085713Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5086299Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5086892Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5087521Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5088060Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5088558Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5089055Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5089547Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5089982Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5090292Z E1204 16:39:42.421000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5090600Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T16:49:50.5090906Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5091229Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074d649dfe640 (most recent call first): 2025-12-04T16:49:50.5091577Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5091858Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5092195Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074d64a7ff640 (most recent call first): 2025-12-04T16:49:50.5092539Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5092818Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5093153Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074d677ffe640 (most recent call first): 2025-12-04T16:49:50.5093548Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5093828Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5094146Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074d6789ff640 (most recent call first): 2025-12-04T16:49:50.5094486Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5094767Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5095095Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000074eec15fe640 (most recent call first): 2025-12-04T16:49:50.5095613Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5096145Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5096612Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5097086Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5097448Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5097826Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000074f8b4e38740 (most recent call first): 2025-12-04T16:49:50.5098326Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5098903Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5099491Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5100097Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5100690Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5101271Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5101880Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5102472Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5103052Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5103549Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5104046Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5104541Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5104971Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5105279Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5105587Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T16:49:50.5105893Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5106214Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007db21a9fe640 (most recent call first): 2025-12-04T16:49:50.5106563Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5106862Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5107181Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007db248dfe640 (most recent call first): 2025-12-04T16:49:50.5107566Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5107844Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5108159Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007db21b3ff640 (most recent call first): 2025-12-04T16:49:50.5108499Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5108775Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5109090Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007db2497ff640 (most recent call first): 2025-12-04T16:49:50.5109429Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5109704Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5110034Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007dca925fe640 (most recent call first): 2025-12-04T16:49:50.5110560Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5111095Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5111556Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5112075Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5112430Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5112746Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007dd48680e740 (most recent call first): 2025-12-04T16:49:50.5113230Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5113796Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5114381Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5114982Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5115561Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5116139Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5116731Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5117317Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5117896Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5118389Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5118881Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5119369Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5119793Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5120094Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5120411Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T16:49:50.5120715Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5121051Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000780b1e7ff640 (most recent call first): 2025-12-04T16:49:50.5121443Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5121719Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5122035Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000780b15bff640 (most recent call first): 2025-12-04T16:49:50.5122374Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5122650Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5122962Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000780b4c1fe640 (most recent call first): 2025-12-04T16:49:50.5123301Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5123577Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5123894Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000780b4cbff640 (most recent call first): 2025-12-04T16:49:50.5124233Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5124508Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5124839Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007823959fe640 (most recent call first): 2025-12-04T16:49:50.5125348Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5125879Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5126339Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5126807Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5127163Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5127516Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000782d88f6e740 (most recent call first): 2025-12-04T16:49:50.5128011Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5128582Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5129184Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5129784Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5130399Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5130991Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5131583Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5132168Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5132699Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5133191Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5133682Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5134168Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5134592Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5134894Z E1204 16:39:42.422000 220195 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5135087Z FAILED [300.1077s] [100%] 2025-12-04T16:49:50.5135157Z 2025-12-04T16:49:50.5135218Z =================================== FAILURES =================================== 2025-12-04T16:49:50.5135402Z ___________________ ReplicateTest.test_train_replicate_fsdp ____________________ 2025-12-04T16:49:50.5135572Z Traceback (most recent call last): 2025-12-04T16:49:50.5135822Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T16:49:50.5136070Z self._join_processes(fn) 2025-12-04T16:49:50.5136320Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T16:49:50.5136588Z self._check_return_codes(fn, elapsed_time) 2025-12-04T16:49:50.5136860Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T16:49:50.5137120Z raise RuntimeError( 2025-12-04T16:49:50.5137283Z RuntimeError: Process 0 terminated or timed out after 300.09897661209106 seconds 2025-12-04T16:49:50.5137536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T16:49:50.5137723Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T16:49:50.5138109Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-bc194f5ff6fd8eb4.xml - 2025-12-04T16:49:50.5138505Z =========================== short test summary info ============================ 2025-12-04T16:49:50.5138858Z FAILED [300.1077s] distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp - RuntimeError: Process 0 terminated or timed out after 300.09897661209106 seconds 2025-12-04T16:49:50.5139220Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T16:49:50.5139407Z =================== 1 failed, 4 passed in 712.38s (0:11:52) ==================== 2025-12-04T16:49:50.5139568Z Got exit code 1 2025-12-04T16:49:50.5139671Z Retrying single test... 2025-12-04T16:49:50.5139975Z Test results will be stored in test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-0c954ec2d78f171d.xml 2025-12-04T16:49:50.5140305Z ============================= test session starts ============================== 2025-12-04T16:49:50.5140521Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:49:50.5140714Z cachedir: .pytest_cache 2025-12-04T16:49:50.5140943Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:49:50.5141184Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:49:50.5141308Z configfile: pytest.ini 2025-12-04T16:49:50.5141536Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:49:50.5141810Z collecting ... collected 5 items / 4 deselected / 1 selected 2025-12-04T16:49:50.5142120Z stepcurrent: skipping 4 already run items. Running only test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp 2025-12-04T16:49:50.5142402Z Running 1 items in this shard 2025-12-04T16:49:50.5142478Z 2025-12-04T16:49:50.5142769Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp I1204 16:39:45.084000 221891 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 221959 2025-12-04T16:49:50.5143247Z I1204 16:39:45.085000 221891 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 221960 2025-12-04T16:49:50.5143593Z I1204 16:39:45.085000 221891 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 221961 2025-12-04T16:49:50.5143939Z I1204 16:39:45.086000 221891 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 221962 2025-12-04T16:49:50.5144310Z [rank0]:I1204 16:44:45.111000 221959 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T16:49:50.5144705Z [rank2]:I1204 16:44:45.111000 221961 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T16:49:50.5145099Z [rank1]:I1204 16:44:45.111000 221960 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T16:49:50.5145491Z [rank3]:I1204 16:44:45.111000 221962 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T16:49:50.5145857Z [rank0]:I1204 16:44:45.112000 221959 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T16:49:50.5146203Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T16:49:50.5146507Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5146824Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079760effe640 (most recent call first): 2025-12-04T16:49:50.5147182Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5147464Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5147810Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007975e0dfe640 (most recent call first): 2025-12-04T16:49:50.5148176Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5148485Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5148802Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079760f9ff640 (most recent call first): 2025-12-04T16:49:50.5149143Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5149420Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5149735Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007975e17ff640 (most recent call first): 2025-12-04T16:49:50.5150075Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5150350Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5150681Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x0000798e53fff640 (most recent call first): 2025-12-04T16:49:50.5151192Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5151722Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5152185Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5152660Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5153017Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5153335Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000079984b5ae740 (most recent call first): 2025-12-04T16:49:50.5153818Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5154384Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5154971Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5155578Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5156160Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5156755Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5157344Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5158028Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5158563Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5159054Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5159546Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5160034Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5160460Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5160762Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5161053Z [rank1]:I1204 16:44:45.112000 221960 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T16:49:50.5161386Z [rank2]:I1204 16:44:45.112000 221961 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T16:49:50.5161727Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T16:49:50.5162030Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5162347Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a92f67ff640 (most recent call first): 2025-12-04T16:49:50.5162689Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5162963Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5163280Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a92f5dfe640 (most recent call first): 2025-12-04T16:49:50.5163620Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5163896Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5164213Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a9323dfe640 (most recent call first): 2025-12-04T16:49:50.5164552Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5164826Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5165139Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007a93247ff640 (most recent call first): 2025-12-04T16:49:50.5165501Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5165778Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5166106Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007aab6d4fe640 (most recent call first): 2025-12-04T16:49:50.5166640Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5167194Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5167711Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5168181Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5168536Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5168853Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007ab561635740 (most recent call first): 2025-12-04T16:49:50.5169338Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5169905Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5170494Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5171094Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5171684Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5172266Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5172865Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5173453Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5173988Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5174508Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5175016Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5175509Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5175953Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5176295Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5176592Z [rank3]:I1204 16:44:45.112000 221962 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T16:49:50.5176937Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T16:49:50.5177243Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5177605Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000709568dfe640 (most recent call first): 2025-12-04T16:49:50.5177950Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5178230Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5178553Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000709596dfe640 (most recent call first): 2025-12-04T16:49:50.5178896Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5179175Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5179494Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007095977ff640 (most recent call first): 2025-12-04T16:49:50.5179836Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5180115Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5180436Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007095697ff640 (most recent call first): 2025-12-04T16:49:50.5180779Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5181060Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5181393Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007095db9ff640 (most recent call first): 2025-12-04T16:49:50.5181906Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5182442Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5182911Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5183385Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5183743Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5184086Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000070b7d3d26740 (most recent call first): 2025-12-04T16:49:50.5184572Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5185160Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5185777Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5186380Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5186962Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5187593Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5188191Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5188779Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5189316Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5189813Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5190312Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5190803Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5191232Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5191539Z E1204 16:44:45.112000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5191845Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T16:49:50.5192151Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5192476Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e0e129fe640 (most recent call first): 2025-12-04T16:49:50.5192821Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5193100Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5193419Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e0e40bfe640 (most recent call first): 2025-12-04T16:49:50.5193784Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5194063Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5194396Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e0e415ff640 (most recent call first): 2025-12-04T16:49:50.5194774Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5195058Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5195375Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e0e133ff640 (most recent call first): 2025-12-04T16:49:50.5195719Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5195995Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5196328Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007e268a1fe640 (most recent call first): 2025-12-04T16:49:50.5196842Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5197376Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5197927Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5198400Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5198761Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5199083Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e307e24a740 (most recent call first): 2025-12-04T16:49:50.5199575Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5200148Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5200737Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5201342Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5201930Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5202512Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5203116Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5203704Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5204276Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5204768Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5205266Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5205754Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5206184Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5206491Z E1204 16:44:45.113000 221891 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5206684Z FAILED [300.2093s] [100%] 2025-12-04T16:49:50.5206758Z 2025-12-04T16:49:50.5206818Z =================================== FAILURES =================================== 2025-12-04T16:49:50.5207000Z ___________________ ReplicateTest.test_train_replicate_fsdp ____________________ 2025-12-04T16:49:50.5207168Z Traceback (most recent call last): 2025-12-04T16:49:50.5207415Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T16:49:50.5207694Z self._join_processes(fn) 2025-12-04T16:49:50.5207943Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T16:49:50.5208209Z self._check_return_codes(fn, elapsed_time) 2025-12-04T16:49:50.5208477Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T16:49:50.5208737Z raise RuntimeError( 2025-12-04T16:49:50.5208898Z RuntimeError: Process 0 terminated or timed out after 300.02735352516174 seconds 2025-12-04T16:49:50.5209105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T16:49:50.5209284Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T16:49:50.5209667Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-0c954ec2d78f171d.xml - 2025-12-04T16:49:50.5210043Z =========================== short test summary info ============================ 2025-12-04T16:49:50.5210390Z FAILED [300.2093s] distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp - RuntimeError: Process 0 terminated or timed out after 300.02735352516174 seconds 2025-12-04T16:49:50.5210735Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T16:49:50.5210907Z ================= 1 failed, 4 deselected in 300.22s (0:05:00) ================== 2025-12-04T16:49:50.5211054Z Got exit code 1 2025-12-04T16:49:50.5211155Z Retrying single test... 2025-12-04T16:49:50.5211457Z Test results will be stored in test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-60a7d8ad4b7e8060.xml 2025-12-04T16:49:50.5211803Z ============================= test session starts ============================== 2025-12-04T16:49:50.5212018Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:49:50.5212211Z cachedir: .pytest_cache 2025-12-04T16:49:50.5212436Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:49:50.5212696Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:49:50.5212836Z configfile: pytest.ini 2025-12-04T16:49:50.5213453Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:49:50.5213728Z collecting ... collected 5 items / 4 deselected / 1 selected 2025-12-04T16:49:50.5214042Z stepcurrent: skipping 4 already run items. Running only test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp 2025-12-04T16:49:50.5214325Z Running 1 items in this shard 2025-12-04T16:49:50.5214401Z 2025-12-04T16:49:50.5214692Z distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp I1204 16:44:47.534000 222265 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 222333 2025-12-04T16:49:50.5215170Z I1204 16:44:47.535000 222265 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 222334 2025-12-04T16:49:50.5215518Z I1204 16:44:47.535000 222265 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 222335 2025-12-04T16:49:50.5215860Z I1204 16:44:47.536000 222265 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 222336 2025-12-04T16:49:50.5216228Z [rank0]:I1204 16:49:47.590000 222333 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 0 2025-12-04T16:49:50.5216628Z [rank2]:I1204 16:49:47.590000 222335 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 2 2025-12-04T16:49:50.5217021Z [rank1]:I1204 16:49:47.590000 222334 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 1 2025-12-04T16:49:50.5217383Z [rank0]:I1204 16:49:47.591000 222333 site-packages/torch/testing/_internal/common_distributed.py:891] Process 0 sent traceback 2025-12-04T16:49:50.5217791Z [rank3]:I1204 16:49:47.590000 222336 site-packages/torch/testing/_internal/common_distributed.py:880] Received event Event.GET_TRACEBACK on process 3 2025-12-04T16:49:50.5218153Z [rank2]:I1204 16:49:47.591000 222335 site-packages/torch/testing/_internal/common_distributed.py:891] Process 2 sent traceback 2025-12-04T16:49:50.5218495Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Process 0 timed out with traceback: 2025-12-04T16:49:50.5218797Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5219115Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e0439bff640 (most recent call first): 2025-12-04T16:49:50.5219458Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5219736Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5220059Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e04427ff640 (most recent call first): 2025-12-04T16:49:50.5220399Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5220673Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5221003Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e046fdfe640 (most recent call first): 2025-12-04T16:49:50.5221344Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5221617Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5221948Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e04707ff640 (most recent call first): 2025-12-04T16:49:50.5222320Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5222594Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5222922Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007e1cb93fe640 (most recent call first): 2025-12-04T16:49:50.5223435Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5223964Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5224432Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5224902Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5225258Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5225573Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007e26ac657740 (most recent call first): 2025-12-04T16:49:50.5226054Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5226623Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5227210Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5227852Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5228433Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5229019Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5229610Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5230193Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5230745Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5231239Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5231775Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5232262Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5232684Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5232988Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5233290Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Process 1 timed out with traceback: 2025-12-04T16:49:50.5233591Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5233910Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c1e5edfe640 (most recent call first): 2025-12-04T16:49:50.5234251Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5234525Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5234839Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c1e317ff640 (most recent call first): 2025-12-04T16:49:50.5235179Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5235452Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5235767Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c1e30dfe640 (most recent call first): 2025-12-04T16:49:50.5236108Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5236391Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5236711Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c1e5f7ff640 (most recent call first): 2025-12-04T16:49:50.5237061Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5237339Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5237709Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x00007c1ea39ff640 (most recent call first): 2025-12-04T16:49:50.5238222Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5238757Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5239224Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5239715Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5240076Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5240413Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x00007c409bf50740 (most recent call first): 2025-12-04T16:49:50.5240933Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5241507Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5242095Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5242708Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5243296Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5243881Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5244473Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5245062Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5245601Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5246097Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5246591Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5247082Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5247553Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5247863Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5248159Z [rank1]:I1204 16:49:47.591000 222334 site-packages/torch/testing/_internal/common_distributed.py:891] Process 1 sent traceback 2025-12-04T16:49:50.5248505Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Process 2 timed out with traceback: 2025-12-04T16:49:50.5248807Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5249149Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000728cdb9fe640 (most recent call first): 2025-12-04T16:49:50.5249493Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5249790Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5250141Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000728cdc3ff640 (most recent call first): 2025-12-04T16:49:50.5250484Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5250760Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5251080Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000728d09dfe640 (most recent call first): 2025-12-04T16:49:50.5251424Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5251701Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5252018Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000728d0a7ff640 (most recent call first): 2025-12-04T16:49:50.5252360Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5252640Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5252970Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000072a54f5fe640 (most recent call first): 2025-12-04T16:49:50.5253480Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5254014Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5254482Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5254957Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5255318Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5255638Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000072af471f0740 (most recent call first): 2025-12-04T16:49:50.5256123Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5256695Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5257280Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5257911Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5258520Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5259101Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5259735Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5260324Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5260857Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5261351Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5261848Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5262339Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5262769Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5263076Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5263380Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Process 3 timed out with traceback: 2025-12-04T16:49:50.5263684Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5264005Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000075de9b3ff640 (most recent call first): 2025-12-04T16:49:50.5264349Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5264626Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5264949Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000075de9a9fe640 (most recent call first): 2025-12-04T16:49:50.5265294Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5265572Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5265889Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000075dec8dfe640 (most recent call first): 2025-12-04T16:49:50.5266235Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5266513Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5266831Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x000075dec97ff640 (most recent call first): 2025-12-04T16:49:50.5267172Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5267461Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5267834Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Current thread 0x000075f7128fe640 (most recent call first): 2025-12-04T16:49:50.5268379Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 885 in _event_listener 2025-12-04T16:49:50.5268924Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 953 in run 2025-12-04T16:49:50.5269390Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 1016 in _bootstrap_inner 2025-12-04T16:49:50.5269864Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/threading.py", line 973 in _bootstrap 2025-12-04T16:49:50.5270222Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5270540Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] Thread 0x0000760105c51740 (most recent call first): 2025-12-04T16:49:50.5271028Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3004 in all_reduce 2025-12-04T16:49:50.5271601Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper 2025-12-04T16:49:50.5272191Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/var/lib/jenkins/pytorch/test/distributed/_composable/test_replicate_with_fsdp.py", line 242 in test_train_replicate_fsdp 2025-12-04T16:49:50.5272800Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 227 in wrapper 2025-12-04T16:49:50.5273383Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329 in wrapper 2025-12-04T16:49:50.5273966Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 772 in wrapper 2025-12-04T16:49:50.5274560Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 925 in run_test 2025-12-04T16:49:50.5275149Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 903 in _run 2025-12-04T16:49:50.5275684Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 108 in run 2025-12-04T16:49:50.5276179Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap 2025-12-04T16:49:50.5276685Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 129 in _main 2025-12-04T16:49:50.5277175Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main 2025-12-04T16:49:50.5277634Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] File "", line 1 in 2025-12-04T16:49:50.5277969Z E1204 16:49:47.591000 222265 site-packages/torch/testing/_internal/common_distributed.py:988] 2025-12-04T16:49:50.5278282Z [rank3]:I1204 16:49:47.591000 222336 site-packages/torch/testing/_internal/common_distributed.py:891] Process 3 sent traceback 2025-12-04T16:49:50.5278518Z FAILED [300.2529s] [100%] 2025-12-04T16:49:50.5278593Z 2025-12-04T16:49:50.5278651Z =================================== FAILURES =================================== 2025-12-04T16:49:50.5278837Z ___________________ ReplicateTest.test_train_replicate_fsdp ____________________ 2025-12-04T16:49:50.5279009Z Traceback (most recent call last): 2025-12-04T16:49:50.5279261Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 770, in wrapper 2025-12-04T16:49:50.5279510Z self._join_processes(fn) 2025-12-04T16:49:50.5279760Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1039, in _join_processes 2025-12-04T16:49:50.5280037Z self._check_return_codes(fn, elapsed_time) 2025-12-04T16:49:50.5280313Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1084, in _check_return_codes 2025-12-04T16:49:50.5280577Z raise RuntimeError( 2025-12-04T16:49:50.5280742Z RuntimeError: Process 0 terminated or timed out after 300.0555052757263 seconds 2025-12-04T16:49:50.5280952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T16:49:50.5281135Z Timing out after 300 seconds and killing subprocesses. 2025-12-04T16:49:50.5281523Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-60a7d8ad4b7e8060.xml - 2025-12-04T16:49:50.5281902Z =========================== short test summary info ============================ 2025-12-04T16:49:50.5282254Z FAILED [300.2529s] distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp - RuntimeError: Process 0 terminated or timed out after 300.0555052757263 seconds 2025-12-04T16:49:50.5282600Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-12-04T16:49:50.5282775Z ================= 1 failed, 4 deselected in 300.26s (0:05:00) ================== 2025-12-04T16:49:50.5282928Z Got exit code 1 2025-12-04T16:49:50.5283150Z FAILED CONSISTENTLY: test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp 2025-12-04T16:49:50.5283476Z Test failed consistently, continuing with the rest of the tests due to continue-through-error being set 2025-12-04T16:49:50.5283877Z Test results will be stored in test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-b3d79708301e8b6b.xml 2025-12-04T16:49:50.5284207Z ============================= test session starts ============================== 2025-12-04T16:49:50.5284421Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T16:49:50.5284617Z cachedir: .pytest_cache 2025-12-04T16:49:50.5284846Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T16:49:50.5285089Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T16:49:50.5285215Z configfile: pytest.ini 2025-12-04T16:49:50.5285448Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T16:49:50.5285743Z collecting ... collected 5 items / 5 deselected / 0 selected 2025-12-04T16:49:50.5285911Z stepcurrent: skipping 5 already run items. 2025-12-04T16:49:50.5286047Z Running 0 items in this shard 2025-12-04T16:49:50.5286122Z 2025-12-04T16:49:50.5286405Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed._composable.test_replicate_with_fsdp/distributed._composable.test_replicate_with_fsdp-b3d79708301e8b6b.xml - 2025-12-04T16:49:50.5286822Z ============================ 5 deselected in 0.00s ============================= 2025-12-04T16:49:50.5287114Z The following tests failed consistently: ['test/distributed/_composable/test_replicate_with_fsdp.py::ReplicateTest::test_train_replicate_fsdp'] 2025-12-04T16:49:50.5287335Z 2025-12-04T16:49:50.5287621Z FINISHED PRINTING LOG FILE of distributed/_composable/test_replicate_with_fsdp 1/1 (test/test-reports/distributed._composable.test_replicate_with_fsdp_1.1_f5ca0749d20b4f7e_.log) 2025-12-04T16:49:50.5287890Z 2025-12-04T16:49:50.5288040Z Finished distributed/_composable/test_replicate_with_fsdp 1/1 ... [2025-12-04 16:49:50.504835][2272745.44107906], took 22.03min 2025-12-04T16:49:50.5288491Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:49:50.5288892Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:49:50.5289120Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T16:49:50.5289307Z Uploading artifacts took 0.00 seconds 2025-12-04T16:49:50.5289471Z distributed/_composable/test_replicate_with_fsdp 1/1 failed! 2025-12-04T16:49:50.5289706Z Running distributed/tensor/test_xla_integration 1/1 ... [2025-12-04 16:49:50.507694][2272745.443944263] 2025-12-04T16:49:50.5289917Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:49:50.5290334Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_xla_integration.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:49:50.507922] 2025-12-04T16:49:52.7264829Z 2025-12-04T16:49:52.7266154Z distributed/tensor/test_xla_integration 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_xla_integration_1.1_81f954cdae1f6c7f_.log 2025-12-04T16:49:52.7268179Z Running 3 items in this shard: test/distributed/tensor/test_xla_integration.py::DTensorXLAIntegrationTest::test_xla_distribute_tensor_1d_replicate, test/distributed/tensor/test_xla_integration.py::DTensorXLAIntegrationTest::test_xla_distribute_tensor_1d_shard, test/distributed/tensor/test_xla_integration.py::DTensorXLAIntegrationTest::test_xla_distribute_tensor_2d 2025-12-04T16:49:52.7277692Z 2025-12-04T16:49:52.7277954Z Finished distributed/tensor/test_xla_integration 1/1 ... [2025-12-04 16:49:52.726221][2272747.662466421], took 0.04min 2025-12-04T16:49:52.7278738Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:49:52.7292784Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:49:52.7294201Z Running distributed/checkpoint/_experimental/test_types 1/1 ... [2025-12-04 16:49:52.729249][2272747.665498324] 2025-12-04T16:49:52.7294521Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:49:52.7295927Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/_experimental/test_types.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:49:52.729474] 2025-12-04T16:49:55.1985578Z 2025-12-04T16:49:55.1987215Z distributed/checkpoint/_experimental/test_types 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint._experimental.test_types_1.1_2a01e8d876288075_.log 2025-12-04T16:49:55.1988700Z Running 3 items in this shard: test/distributed/checkpoint/_experimental/test_types.py::TestRankInfo::test_rank_info_default_initialization, test/distributed/checkpoint/_experimental/test_types.py::TestRankInfo::test_rank_info_initialization, test/distributed/checkpoint/_experimental/test_types.py::TestRankInfo::test_state_dict_type_alias 2025-12-04T16:49:55.1989629Z 2025-12-04T16:49:55.1989934Z Finished distributed/checkpoint/_experimental/test_types 1/1 ... [2025-12-04 16:49:55.198257][2272750.134501701], took 0.04min 2025-12-04T16:49:55.1998647Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:49:55.2010775Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:49:55.2013309Z Running distributed/tensor/experimental/test_register_sharding 1/1 ... [2025-12-04 16:49:55.201242][2272750.137492273] 2025-12-04T16:49:55.2013632Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:49:55.2015552Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/experimental/test_register_sharding.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:49:55.201456] 2025-12-04T16:58:42.0968331Z 2025-12-04T16:58:42.0969609Z distributed/tensor/experimental/test_register_sharding 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.experimental.test_register_sharding_1.1_a2cecfd057c235b5_.log 2025-12-04T16:58:42.0972058Z Running 3 items in this shard: test/distributed/tensor/experimental/test_register_sharding.py::TestRegisterSharding::test_argmax, test/distributed/tensor/experimental/test_register_sharding.py::TestRegisterSharding::test_register_sharding_for_tensor_kwargs, test/distributed/tensor/experimental/test_register_sharding.py::TestRegisterSharding::test_softmax_fwd 2025-12-04T16:58:42.0973429Z 2025-12-04T16:58:42.0973751Z Finished distributed/tensor/experimental/test_register_sharding 1/1 ... [2025-12-04 16:58:42.096536][2273277.03278382], took 8.78min 2025-12-04T16:58:42.0977957Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:58:42.0987186Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:58:42.0989005Z Running distributed/test_backends 1/1 ... [2025-12-04 16:58:42.098806][2273277.03505755] 2025-12-04T16:58:42.0989314Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:58:42.0991124Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/test_backends.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:58:42.098999] 2025-12-04T16:58:44.4681568Z 2025-12-04T16:58:44.4682449Z distributed/test_backends 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.test_backends_1.1_a351f86089a5c079_.log 2025-12-04T16:58:44.4683602Z Running 2 items in this shard: test/distributed/test_backends.py::TestMiscCollectiveUtilsCUDA::test_create_pg_cuda, test/distributed/test_backends.py::TestMiscCollectiveUtilsCUDA::test_device_to_backend_mapping_cuda 2025-12-04T16:58:44.4684235Z 2025-12-04T16:58:44.4684630Z Finished distributed/test_backends 1/1 ... [2025-12-04 16:58:44.467874][2273279.404121591], took 0.04min 2025-12-04T16:58:44.4691813Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T16:58:44.4701887Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T16:58:44.4703313Z Running distributed/tensor/test_experimental_ops 1/1 ... [2025-12-04 16:58:44.470240][2273279.40649039] 2025-12-04T16:58:44.4703713Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T16:58:44.4705919Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_experimental_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 16:58:44.470443] 2025-12-04T17:03:25.9296369Z 2025-12-04T17:03:25.9297120Z distributed/tensor/test_experimental_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.tensor.test_experimental_ops_1.1_28eebea36804df45_.log 2025-12-04T17:03:25.9298409Z Running 6 items in this shard: test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli, test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTest::test_nll, test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTest::test_slice, test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTestWithLocalTensor::test_bernoulli, test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTestWithLocalTensor::test_nll, test/distributed/tensor/test_experimental_ops.py::DistOtherOpsTestWithLocalTensor::test_slice 2025-12-04T17:03:25.9299361Z 2025-12-04T17:03:25.9299513Z Finished distributed/tensor/test_experimental_ops 1/1 ... [2025-12-04 17:03:25.929322][2273560.865569963], took 4.69min 2025-12-04T17:03:25.9306527Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T17:03:25.9315998Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T17:03:25.9318442Z Running distributed/checkpoint/test_quantized_hf_storage 1/1 ... [2025-12-04 17:03:25.931709][2273560.867960767] 2025-12-04T17:03:25.9318680Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T17:03:25.9320462Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_quantized_hf_storage.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 17:03:25.931914] 2025-12-04T17:03:28.2006376Z 2025-12-04T17:03:28.2007948Z distributed/checkpoint/test_quantized_hf_storage 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_quantized_hf_storage_1.1_966c622301539874_.log 2025-12-04T17:03:28.2009916Z Running 2 items in this shard: test/distributed/checkpoint/test_quantized_hf_storage.py::TestQuantizedHfStorage::test_dequantization, test/distributed/checkpoint/test_quantized_hf_storage.py::TestQuantizedHfStorage::test_dtensor_slice_dequantization_block_alignment 2025-12-04T17:03:28.2011063Z 2025-12-04T17:03:28.2011508Z Finished distributed/checkpoint/test_quantized_hf_storage 1/1 ... [2025-12-04 17:03:28.200323][2273563.136569822], took 0.04min 2025-12-04T17:03:28.2020353Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T17:03:28.2030481Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T17:03:28.2032868Z Running distributed/_composable/test_composability/test_pp_composability 1/1 ... [2025-12-04 17:03:28.203153][2273563.1394041] 2025-12-04T17:03:28.2033376Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T17:03:28.2035565Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/_composable/test_composability/test_pp_composability.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 17:03:28.203353] 2025-12-04T17:03:30.2214937Z 2025-12-04T17:03:30.2215695Z distributed/_composable/test_composability/test_pp_composability 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed._composable.test_composability.test_pp_composability_1.1_c56e5590e0a1cf82_.log 2025-12-04T17:03:30.2224289Z Running 26 items in this shard: test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass0_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass0_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass1_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass1_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass2_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass2_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass3_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass3_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass4_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_3d_with_tp_dp_pp_ScheduleClass4_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_pp_and_dcp, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass0_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass0_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass1_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass1_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass2_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass2_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass3_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass3_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass4_bfloat16, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_ScheduleClass4_float32, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_grads_ScheduleClass0, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_grads_ScheduleClass1, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_grads_ScheduleClass2, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_grads_ScheduleClass3, test/distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_replicate_pp_grads_ScheduleClass4 2025-12-04T17:03:30.2231470Z 2025-12-04T17:03:30.2231775Z Finished distributed/_composable/test_composability/test_pp_composability 1/1 ... [2025-12-04 17:03:30.221111][2273565.157358988], took 0.03min 2025-12-04T17:03:30.2232391Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T17:03:30.2234739Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T17:03:30.2236622Z Running distributed/checkpoint/test_async_process_executor 1/1 ... [2025-12-04 17:03:30.223583][2273565.159833821] 2025-12-04T17:03:30.2236914Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T17:03:30.2238965Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/checkpoint/test_async_process_executor.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 17:03:30.223776] 2025-12-04T17:13:07.6887819Z 2025-12-04T17:13:07.6892535Z distributed/checkpoint/test_async_process_executor 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributed.checkpoint.test_async_process_executor_1.1_f3177b4a37093288_.log 2025-12-04T17:13:07.6895751Z Running 5 items in this shard: test/distributed/checkpoint/test_async_process_executor.py::TestAsyncProcessExecutor::test_checkpoint_save_failure_continues_serving, test/distributed/checkpoint/test_async_process_executor.py::TestAsyncProcessExecutorPrefixStore::test_checkpoint_save_with_prefix_store_enabled, test/distributed/checkpoint/test_async_process_executor.py::TestProcessGroupInitInfo::test_process_group_init_info_with_default_pg, test/distributed/checkpoint/test_async_process_executor.py::TestProcessGroupInitInfo::test_process_group_init_info_with_prefix_store_env_var, test/distributed/checkpoint/test_async_process_executor.py::TestProcessGroupInitInfo::test_process_group_init_info_without_prefix_store_env_var 2025-12-04T17:13:07.6898262Z 2025-12-04T17:13:07.6898627Z Finished distributed/checkpoint/test_async_process_executor 1/1 ... [2025-12-04 17:13:07.688583][2274142.624828274], took 9.62min 2025-12-04T17:13:07.6904640Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/distributed.test_dynamo_distributed/distributed.test_dynamo_distributed-2f8f6aba4ad822d0.xml 2025-12-04T17:13:07.6915822Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T17:13:07.6916166Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T17:13:07.6916448Z Uploading artifacts took 0.00 seconds 2025-12-04T17:13:07.6918215Z Running distributed/tensor/test_tensor_ops 1/4 ... [2025-12-04 17:13:07.691699][2274142.627949752] 2025-12-04T17:13:07.6918533Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T17:13:07.6920380Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributed/tensor/test_tensor_ops.py', '--shard-id=1', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=0', '--import-slow-tests', '--import-disabled-tests'] ... [2025-12-04 17:13:07.691919] 2025-12-04T17:17:58.5734418Z ##[error]The action 'Test' has timed out after 270 minutes. 2025-12-04T17:17:58.5787801Z ##[group]Run # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct 2025-12-04T17:17:58.5788126Z # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct 2025-12-04T17:17:58.5788518Z docker exec -t "d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test" 2025-12-04T17:17:58.5793687Z shell: /usr/bin/bash -e {0} 2025-12-04T17:17:58.5793804Z env: 2025-12-04T17:17:58.5793906Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:17:58.5794050Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:17:58.5794236Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:17:58.5794405Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:17:58.5794937Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:17:58.5795527Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:17:58.5795712Z AWS_REGION: us-east-1 2025-12-04T17:17:58.5795903Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:17:58.5796062Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:17:58.5798204Z AWS_SESSION_TOKEN: *** 2025-12-04T17:17:58.5798381Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:17:58.5798572Z ##[endgroup] 2025-12-04T17:17:58.6788909Z ##[group]Run docker exec -t "d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3" sh -c "sudo chown -R 1001:1001 test" 2025-12-04T17:17:58.6789381Z docker exec -t "d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3" sh -c "sudo chown -R 1001:1001 test" 2025-12-04T17:17:58.6793021Z shell: /usr/bin/bash -e {0} 2025-12-04T17:17:58.6793152Z env: 2025-12-04T17:17:58.6793266Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:17:58.6793433Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:17:58.6793651Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:17:58.6793854Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:17:58.6794469Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:17:58.6795061Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:17:58.6795203Z AWS_REGION: us-east-1 2025-12-04T17:17:58.6795366Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:17:58.6795551Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:17:58.6797961Z AWS_SESSION_TOKEN: *** 2025-12-04T17:17:58.6798140Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:17:58.6798329Z ##[endgroup] 2025-12-04T17:17:58.7684568Z ##[group]Run cat test/**/*_toprint.log || true 2025-12-04T17:17:58.7684772Z cat test/**/*_toprint.log || true 2025-12-04T17:17:58.7689686Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T17:17:58.7689844Z env: 2025-12-04T17:17:58.7689942Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:17:58.7690081Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:17:58.7690261Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:17:58.7690430Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:17:58.7690937Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:17:58.7691429Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:17:58.7691583Z AWS_REGION: us-east-1 2025-12-04T17:17:58.7691770Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:17:58.7691925Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:17:58.7694042Z AWS_SESSION_TOKEN: *** 2025-12-04T17:17:58.7694217Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:17:58.7694403Z ##[endgroup] 2025-12-04T17:17:58.7746097Z Test results will be stored in test-reports/python-pytest/distributed.tensor.test_tensor_ops/distributed.tensor.test_tensor_ops-715e74ab9dcc4ee3.xml 2025-12-04T17:17:58.7746659Z ============================= test session starts ============================== 2025-12-04T17:17:58.7747007Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T17:17:58.7747201Z cachedir: .pytest_cache 2025-12-04T17:17:58.7747532Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T17:17:58.7747817Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T17:17:58.7750943Z configfile: pytest.ini 2025-12-04T17:17:58.7751172Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T17:17:58.7751502Z collecting ... collected 62 items 2025-12-04T17:17:58.7751650Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T17:17:58.7753118Z Running 11 items in this shard: test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_aten_contiguous, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_detach, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_index_put_scalar, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_op_out_variant, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_slice, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_unbind, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_zeros_like, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTestWithLocalTensor::test_clone, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTestWithLocalTensor::test_contiguous, test/distributed/tensor/test_tensor_ops.py::DistTensorOpsTestWithLocalTensor::test_new_full 2025-12-04T17:17:58.7754470Z 2025-12-04T17:17:58.7754737Z distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_aten_contiguous I1204 17:13:09.338000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 227327 2025-12-04T17:17:58.7755190Z I1204 17:13:09.338000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 227328 2025-12-04T17:17:58.7755537Z I1204 17:13:09.339000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 227329 2025-12-04T17:17:58.7756118Z I1204 17:13:09.340000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 227330 2025-12-04T17:17:58.7756352Z PASSED [148.5175s] [ 9%] 2025-12-04T17:17:58.7756670Z distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_detach I1204 17:15:37.668000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 0 with pid 227648 2025-12-04T17:17:58.7757113Z I1204 17:15:37.669000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 1 with pid 227649 2025-12-04T17:17:58.7757450Z I1204 17:15:37.669000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 2 with pid 227650 2025-12-04T17:17:58.7757847Z I1204 17:15:37.670000 227259 site-packages/torch/testing/_internal/common_distributed.py:849] Started process 3 with pid 227651 2025-12-04T17:17:58.7821768Z Prepare all required actions 2025-12-04T17:17:58.7822094Z Getting action download info 2025-12-04T17:17:59.1271111Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T17:17:59.9348569Z Download action repository 'actions/upload-artifact@v4' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T17:18:00.8932852Z ##[group]Run ./.github/actions/upload-test-artifacts 2025-12-04T17:18:00.8933012Z with: 2025-12-04T17:18:00.8933105Z use-gha: true 2025-12-04T17:18:00.8933262Z file-suffix: test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223 2025-12-04T17:18:00.8933444Z s3-bucket: gha-artifacts 2025-12-04T17:18:00.8933550Z env: 2025-12-04T17:18:00.8933640Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:00.8933772Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:00.8933947Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:00.8934135Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:00.8934648Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:00.8935219Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:00.8935401Z AWS_REGION: us-east-1 2025-12-04T17:18:00.8935553Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:00.8935704Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:00.8937875Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:00.8938047Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:00.8938230Z ##[endgroup] 2025-12-04T17:18:00.8971229Z ##[group]Run actions/upload-artifact@v4 2025-12-04T17:18:00.8971363Z with: 2025-12-04T17:18:00.8971561Z name: test-jsons-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip 2025-12-04T17:18:00.8971775Z retention-days: 14 2025-12-04T17:18:00.8971880Z if-no-files-found: warn 2025-12-04T17:18:00.8971986Z path: test/**/*.json 2025-12-04T17:18:00.8972087Z compression-level: 6 2025-12-04T17:18:00.8972186Z overwrite: false 2025-12-04T17:18:00.8972293Z include-hidden-files: false 2025-12-04T17:18:00.8972398Z env: 2025-12-04T17:18:00.8972488Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:00.8972627Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:00.8972801Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:00.8972964Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:00.8973471Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:00.8973958Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:00.8974073Z AWS_REGION: us-east-1 2025-12-04T17:18:00.8974208Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:00.8974360Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:00.8976480Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:00.8976652Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:00.8976836Z ##[endgroup] 2025-12-04T17:18:01.5514562Z With the provided path, there will be 6 files uploaded 2025-12-04T17:18:01.5517983Z Artifact name is valid! 2025-12-04T17:18:01.5518599Z Root directory input is valid! 2025-12-04T17:18:01.7906975Z Beginning upload of artifact content to blob storage 2025-12-04T17:18:02.2229765Z Uploaded bytes 44615 2025-12-04T17:18:02.3012491Z Finished uploading artifact content to blob storage! 2025-12-04T17:18:02.3015217Z SHA256 digest of uploaded artifact zip is 23119499e67effdefaebd7e39fbbffb4d950aa26e3b121df9343ab57d5ec7a39 2025-12-04T17:18:02.3018722Z Finalizing artifact upload 2025-12-04T17:18:02.4905592Z Artifact test-jsons-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip.zip successfully finalized. Artifact ID 4767394761 2025-12-04T17:18:02.4907047Z Artifact test-jsons-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip has been successfully uploaded! Final size is 44615 bytes. Artifact ID is 4767394761 2025-12-04T17:18:02.4910328Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19921726347/artifacts/4767394761 2025-12-04T17:18:02.5034076Z ##[group]Run actions/upload-artifact@v4 2025-12-04T17:18:02.5034243Z with: 2025-12-04T17:18:02.5034459Z name: test-reports-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip 2025-12-04T17:18:02.5034699Z retention-days: 14 2025-12-04T17:18:02.5034819Z if-no-files-found: ignore 2025-12-04T17:18:02.5034952Z path: test/**/*.xml test/**/*.csv 2025-12-04T17:18:02.5035084Z compression-level: 6 2025-12-04T17:18:02.5035213Z overwrite: false 2025-12-04T17:18:02.5035330Z include-hidden-files: false 2025-12-04T17:18:02.5035451Z env: 2025-12-04T17:18:02.5035552Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:02.5035694Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:02.5035948Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:02.5036122Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:02.5036697Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:02.5037191Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:02.5037310Z AWS_REGION: us-east-1 2025-12-04T17:18:02.5037525Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:02.5037689Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:02.5039793Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:02.5039967Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:02.5040153Z ##[endgroup] 2025-12-04T17:18:03.1182131Z With the provided path, there will be 60 files uploaded 2025-12-04T17:18:03.1185041Z Artifact name is valid! 2025-12-04T17:18:03.1185776Z Root directory input is valid! 2025-12-04T17:18:03.3523145Z Beginning upload of artifact content to blob storage 2025-12-04T17:18:03.7756835Z Uploaded bytes 63921 2025-12-04T17:18:03.8475498Z Finished uploading artifact content to blob storage! 2025-12-04T17:18:03.8479829Z SHA256 digest of uploaded artifact zip is 7e7da1949660f6855f22ceba356bba1d241f4200db97fe74f7d20c84dd215d02 2025-12-04T17:18:03.8480291Z Finalizing artifact upload 2025-12-04T17:18:03.9865309Z Artifact test-reports-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip.zip successfully finalized. Artifact ID 4767395049 2025-12-04T17:18:03.9866388Z Artifact test-reports-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip has been successfully uploaded! Final size is 63921 bytes. Artifact ID is 4767395049 2025-12-04T17:18:03.9870058Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19921726347/artifacts/4767395049 2025-12-04T17:18:04.0002573Z ##[group]Run actions/upload-artifact@v4 2025-12-04T17:18:04.0002753Z with: 2025-12-04T17:18:04.0002967Z name: logs-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip 2025-12-04T17:18:04.0003192Z retention-days: 14 2025-12-04T17:18:04.0003321Z if-no-files-found: ignore 2025-12-04T17:18:04.0003461Z path: usage_log.txt test/**/*.log 2025-12-04T17:18:04.0003612Z compression-level: 6 2025-12-04T17:18:04.0003736Z overwrite: false 2025-12-04T17:18:04.0003858Z include-hidden-files: false 2025-12-04T17:18:04.0003984Z env: 2025-12-04T17:18:04.0004095Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:04.0004244Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:04.0004607Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:04.0004795Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:04.0005325Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:04.0005848Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:04.0005987Z AWS_REGION: us-east-1 2025-12-04T17:18:04.0006183Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:04.0006354Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:04.0008530Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:04.0008720Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:04.0008924Z ##[endgroup] 2025-12-04T17:18:04.4543974Z Multiple search paths detected. Calculating the least common ancestor of all paths 2025-12-04T17:18:04.4544895Z The least common ancestor is /home/runner/_work/pytorch/pytorch. This will be the root directory of the artifact 2025-12-04T17:18:04.4545158Z With the provided path, there will be 40 files uploaded 2025-12-04T17:18:04.4550156Z Artifact name is valid! 2025-12-04T17:18:04.4550276Z Root directory input is valid! 2025-12-04T17:18:04.6842199Z Beginning upload of artifact content to blob storage 2025-12-04T17:18:05.6201294Z Uploaded bytes 1428959 2025-12-04T17:18:05.6921347Z Finished uploading artifact content to blob storage! 2025-12-04T17:18:05.6922492Z SHA256 digest of uploaded artifact zip is bbbafdb575e6bc689fada2f575ce74545b3a5978201f7aec3f748dea1943d6b6 2025-12-04T17:18:05.6923544Z Finalizing artifact upload 2025-12-04T17:18:05.8623032Z Artifact logs-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip.zip successfully finalized. Artifact ID 4767395416 2025-12-04T17:18:05.8624507Z Artifact logs-runattempt1-test-distributed-3-3-linux.rocm.gpu.gfx942.4.b_57113808223.zip has been successfully uploaded! Final size is 1428959 bytes. Artifact ID is 4767395416 2025-12-04T17:18:05.8628964Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19921726347/artifacts/4767395416 2025-12-04T17:18:05.8777684Z ##[group]Run # shellcheck disable=SC2156 2025-12-04T17:18:05.8777903Z # shellcheck disable=SC2156 2025-12-04T17:18:05.8778155Z find . -iname "core.[1-9]*" -exec docker exec "${CONTAINER_NAME}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2025-12-04T17:18:05.8783089Z shell: /usr/bin/bash -e {0} 2025-12-04T17:18:05.8783227Z env: 2025-12-04T17:18:05.8783340Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:05.8783505Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:05.8783703Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:05.8783893Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:05.8784452Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:05.8784974Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:05.8785111Z AWS_REGION: us-east-1 2025-12-04T17:18:05.8785306Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:05.8785486Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:05.8787637Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:05.8787827Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:05.8788025Z ##[endgroup] 2025-12-04T17:18:06.0162882Z ##[group]Run actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 2025-12-04T17:18:06.0163086Z with: 2025-12-04T17:18:06.0163239Z name: coredumps-distributed-3-3-linux.rocm.gpu.gfx942.4.b 2025-12-04T17:18:06.0163409Z retention-days: 14 2025-12-04T17:18:06.0163527Z if-no-files-found: ignore 2025-12-04T17:18:06.0163650Z path: ./**/core.[1-9]* 2025-12-04T17:18:06.0163771Z compression-level: 6 2025-12-04T17:18:06.0163885Z overwrite: false 2025-12-04T17:18:06.0163997Z include-hidden-files: false 2025-12-04T17:18:06.0164131Z env: 2025-12-04T17:18:06.0164240Z GIT_DEFAULT_BRANCH: main 2025-12-04T17:18:06.0164389Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T17:18:06.0164584Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T17:18:06.0164758Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T17:18:06.0165308Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD128 --device /dev/dri/renderD136 --device /dev/dri/renderD144 --device /dev/dri/renderD152 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T17:18:06.0165817Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T17:18:06.0165946Z AWS_REGION: us-east-1 2025-12-04T17:18:06.0166136Z AWS_ACCESS_KEY_ID: *** 2025-12-04T17:18:06.0166302Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T17:18:06.0168472Z AWS_SESSION_TOKEN: *** 2025-12-04T17:18:06.0168658Z CONTAINER_NAME: d3f2328cea4b95bb970a3d395bda82ae307de4cda5c963d880ae80256e345be3 2025-12-04T17:18:06.0168930Z ##[endgroup] 2025-12-04T17:18:10.3148283Z No files were found with the provided path: ./**/core.[1-9]*. No artifacts will be uploaded. 2025-12-04T17:18:10.3356205Z Post job cleanup. 2025-12-04T17:18:10.3369789Z Post job cleanup. 2025-12-04T17:18:10.3576461Z Logging out of registry 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T17:18:10.4468202Z Post job cleanup. 2025-12-04T17:18:10.5316155Z Post job cleanup. 2025-12-04T17:18:10.5336236Z Post job cleanup. 2025-12-04T17:18:10.5878054Z [command]/usr/bin/git version 2025-12-04T17:18:10.5914929Z git version 2.52.0 2025-12-04T17:18:10.5936694Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/01a340bf-886e-4548-b250-7d388a4e5d25/.gitconfig' 2025-12-04T17:18:10.5942670Z Temporarily overriding HOME='/home/runner/_work/_temp/01a340bf-886e-4548-b250-7d388a4e5d25' before making global git config changes 2025-12-04T17:18:10.5943160Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T17:18:10.5955629Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T17:18:10.6008125Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T17:18:10.6025366Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T17:18:10.6236120Z Entering 'android/libs/fbjni' 2025-12-04T17:18:10.6275257Z Entering 'third_party/FP16' 2025-12-04T17:18:10.6317600Z Entering 'third_party/FXdiv' 2025-12-04T17:18:10.6354725Z Entering 'third_party/NNPACK' 2025-12-04T17:18:10.6393801Z Entering 'third_party/NVTX' 2025-12-04T17:18:10.6423195Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:10.6459075Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:10.6500585Z Entering 'third_party/aiter' 2025-12-04T17:18:10.6532828Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:10.6565619Z Entering 'third_party/benchmark' 2025-12-04T17:18:10.6598072Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:10.6628484Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:10.6656850Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:10.6700552Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:10.6723441Z Entering 'third_party/cutlass' 2025-12-04T17:18:10.6750072Z Entering 'third_party/fbgemm' 2025-12-04T17:18:10.6778431Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:10.6798051Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:10.6829255Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:10.6849224Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:10.6875818Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:10.6900941Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:10.6924792Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:10.6953715Z Entering 'third_party/flash-attention' 2025-12-04T17:18:10.6977936Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:10.7000144Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:10.7023152Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:10.7045276Z Entering 'third_party/fmt' 2025-12-04T17:18:10.7066679Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:10.7110916Z Entering 'third_party/gloo' 2025-12-04T17:18:10.7136563Z Entering 'third_party/googletest' 2025-12-04T17:18:10.7158706Z Entering 'third_party/ideep' 2025-12-04T17:18:10.7183312Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:10.7215241Z Entering 'third_party/ittapi' 2025-12-04T17:18:10.7240290Z Entering 'third_party/kineto' 2025-12-04T17:18:10.7278417Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:10.7313421Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:10.7342565Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:10.7366249Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:10.7397161Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:10.7425140Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:10.7462916Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:10.7488170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:10.7526815Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:10.7562232Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:10.7599975Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:10.7623036Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:10.7655144Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:10.7681517Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:10.7702253Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:10.7725609Z Entering 'third_party/kleidiai' 2025-12-04T17:18:10.7756530Z Entering 'third_party/mimalloc' 2025-12-04T17:18:10.7781536Z Entering 'third_party/nlohmann' 2025-12-04T17:18:10.7805653Z Entering 'third_party/onnx' 2025-12-04T17:18:10.7837224Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:10.7860581Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:10.7885472Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:10.7910603Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:10.7930462Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:10.7950239Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:10.7975256Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:10.8011548Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:10.8034038Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:10.8057686Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:10.8086771Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:10.8111603Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:10.8146592Z Entering 'third_party/pocketfft' 2025-12-04T17:18:10.8175191Z Entering 'third_party/protobuf' 2025-12-04T17:18:10.8230078Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:10.8254123Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:10.8277800Z Entering 'third_party/psimd' 2025-12-04T17:18:10.8309598Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:10.8339727Z Entering 'third_party/pybind11' 2025-12-04T17:18:10.8362020Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:10.8383658Z Entering 'third_party/sleef' 2025-12-04T17:18:10.8422554Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:10.8452570Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:10.8482596Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:10.8503426Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:10.8523870Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:10.8543796Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:10.8593403Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T17:18:10.8610432Z http.https://github.com/.extraheader 2025-12-04T17:18:10.8622768Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T17:18:10.8649441Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T17:18:10.8883272Z Entering 'android/libs/fbjni' 2025-12-04T17:18:10.8901901Z http.https://github.com/.extraheader 2025-12-04T17:18:10.8922699Z Entering 'third_party/FP16' 2025-12-04T17:18:10.8940194Z http.https://github.com/.extraheader 2025-12-04T17:18:10.8964586Z Entering 'third_party/FXdiv' 2025-12-04T17:18:10.8980763Z http.https://github.com/.extraheader 2025-12-04T17:18:10.8997110Z Entering 'third_party/NNPACK' 2025-12-04T17:18:10.9009811Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9027148Z Entering 'third_party/NVTX' 2025-12-04T17:18:10.9039472Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9055138Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:10.9072743Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9088895Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:10.9102305Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9125606Z Entering 'third_party/aiter' 2025-12-04T17:18:10.9139503Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9157004Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:10.9169620Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9190288Z Entering 'third_party/benchmark' 2025-12-04T17:18:10.9217817Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9235903Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:10.9255868Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9286495Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:10.9303253Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9326112Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:10.9345969Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9370428Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:10.9391625Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9411791Z Entering 'third_party/cutlass' 2025-12-04T17:18:10.9427742Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9448650Z Entering 'third_party/fbgemm' 2025-12-04T17:18:10.9461729Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9480131Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:10.9493945Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9511893Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:10.9532886Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9554013Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:10.9569865Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9587728Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:10.9601768Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9636867Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:10.9650435Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9672395Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:10.9685323Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9701278Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:10.9714252Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9733050Z Entering 'third_party/flash-attention' 2025-12-04T17:18:10.9745435Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9764648Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:10.9777455Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9797159Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:10.9811035Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9842316Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:10.9855074Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9874494Z Entering 'third_party/fmt' 2025-12-04T17:18:10.9890018Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9906675Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:10.9918577Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9934523Z Entering 'third_party/gloo' 2025-12-04T17:18:10.9948245Z http.https://github.com/.extraheader 2025-12-04T17:18:10.9971089Z Entering 'third_party/googletest' 2025-12-04T17:18:10.9987883Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0004680Z Entering 'third_party/ideep' 2025-12-04T17:18:11.0025087Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0044144Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:11.0057014Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0077846Z Entering 'third_party/ittapi' 2025-12-04T17:18:11.0095801Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0113071Z Entering 'third_party/kineto' 2025-12-04T17:18:11.0129042Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0146761Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:11.0161490Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0181861Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:11.0199729Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0219038Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:11.0232760Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0252984Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:11.0265448Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0283500Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:11.0294763Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0318005Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:11.0331709Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0350131Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:11.0362454Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0379794Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:11.0391437Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0407677Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:11.0431264Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0450752Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:11.0465222Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0484247Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:11.0495890Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0512026Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.0524470Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0541658Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.0555461Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0575564Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:11.0588102Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0605229Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:11.0620626Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0641604Z Entering 'third_party/kleidiai' 2025-12-04T17:18:11.0663787Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0692805Z Entering 'third_party/mimalloc' 2025-12-04T17:18:11.0719798Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0742383Z Entering 'third_party/nlohmann' 2025-12-04T17:18:11.0761810Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0785800Z Entering 'third_party/onnx' 2025-12-04T17:18:11.0804411Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0839309Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:11.0862234Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0886641Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:11.0922725Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0949450Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:11.0966920Z http.https://github.com/.extraheader 2025-12-04T17:18:11.0992271Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:11.1007745Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1027215Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:11.1045036Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1062264Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:11.1077431Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1095774Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:11.1108623Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1127372Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:11.1142254Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1159530Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:11.1171461Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1190206Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.1203266Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1223468Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.1243330Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1265931Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:11.1282034Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1312278Z Entering 'third_party/pocketfft' 2025-12-04T17:18:11.1331192Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1346726Z Entering 'third_party/protobuf' 2025-12-04T17:18:11.1359154Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1377763Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:11.1391030Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1409081Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:11.1422410Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1443179Z Entering 'third_party/psimd' 2025-12-04T17:18:11.1457635Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1472990Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:11.1490938Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1507984Z Entering 'third_party/pybind11' 2025-12-04T17:18:11.1521149Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1542998Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:11.1556222Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1572630Z Entering 'third_party/sleef' 2025-12-04T17:18:11.1586184Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1604669Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:11.1620216Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1638220Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:11.1650497Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1666658Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:11.1684418Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1701281Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:11.1713584Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1728971Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:11.1740116Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1756759Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:11.1769735Z http.https://github.com/.extraheader 2025-12-04T17:18:11.1826387Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.1850155Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T17:18:11.2238864Z Entering 'android/libs/fbjni' 2025-12-04T17:18:11.2258343Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T17:18:11.2271300Z Entering 'third_party/FP16' 2025-12-04T17:18:11.2286465Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T17:18:11.2299496Z Entering 'third_party/FXdiv' 2025-12-04T17:18:11.2313601Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T17:18:11.2322464Z Entering 'third_party/NNPACK' 2025-12-04T17:18:11.2333724Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T17:18:11.2343690Z Entering 'third_party/NVTX' 2025-12-04T17:18:11.2355044Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T17:18:11.2364341Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:11.2374192Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T17:18:11.2382778Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:11.2393626Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T17:18:11.2412703Z Entering 'third_party/aiter' 2025-12-04T17:18:11.2424779Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T17:18:11.2433993Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:11.2446219Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T17:18:11.2462253Z Entering 'third_party/benchmark' 2025-12-04T17:18:11.2474171Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:11.2485017Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:11.2495462Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T17:18:11.2509536Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:11.2519812Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T17:18:11.2529890Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:11.2539552Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T17:18:11.2549406Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:11.2560607Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T17:18:11.2569665Z Entering 'third_party/cutlass' 2025-12-04T17:18:11.2581767Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T17:18:11.2596755Z Entering 'third_party/fbgemm' 2025-12-04T17:18:11.2608981Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T17:18:11.2619434Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:11.2628896Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T17:18:11.2638463Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:11.2648981Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T17:18:11.2661518Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:11.2670741Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T17:18:11.2679202Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:11.2688939Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T17:18:11.2703337Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:11.2714497Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T17:18:11.2724093Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:11.2734647Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T17:18:11.2743280Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:11.2752534Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T17:18:11.2770437Z Entering 'third_party/flash-attention' 2025-12-04T17:18:11.2781571Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T17:18:11.2791330Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:11.2802037Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T17:18:11.2816273Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:11.2826583Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T17:18:11.2844901Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:11.2860395Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T17:18:11.2870246Z Entering 'third_party/fmt' 2025-12-04T17:18:11.2881446Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T17:18:11.2891089Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:11.2902371Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T17:18:11.2924284Z Entering 'third_party/gloo' 2025-12-04T17:18:11.2925619Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T17:18:11.2931707Z Entering 'third_party/googletest' 2025-12-04T17:18:11.2942844Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.2955349Z Entering 'third_party/ideep' 2025-12-04T17:18:11.2965335Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T17:18:11.2974401Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:11.2983824Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T17:18:11.3002460Z Entering 'third_party/ittapi' 2025-12-04T17:18:11.3012056Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T17:18:11.3021487Z Entering 'third_party/kineto' 2025-12-04T17:18:11.3031835Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T17:18:11.3040879Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:11.3050582Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T17:18:11.3068337Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:11.3078300Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T17:18:11.3092783Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:11.3102757Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T17:18:11.3111015Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:11.3120351Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T17:18:11.3128890Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:11.3137880Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T17:18:11.3146409Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:11.3171281Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T17:18:11.3183840Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:11.3197334Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T17:18:11.3206246Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:11.3215362Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.3227811Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:11.3237765Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T17:18:11.3252382Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:11.3263021Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T17:18:11.3272456Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:11.3282998Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T17:18:11.3293270Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.3304167Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T17:18:11.3314514Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.3324603Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T17:18:11.3336817Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:11.3361499Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T17:18:11.3369928Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:11.3382179Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.3392385Z Entering 'third_party/kleidiai' 2025-12-04T17:18:11.3402581Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T17:18:11.3412386Z Entering 'third_party/mimalloc' 2025-12-04T17:18:11.3422607Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T17:18:11.3432093Z Entering 'third_party/nlohmann' 2025-12-04T17:18:11.3442125Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T17:18:11.3451841Z Entering 'third_party/onnx' 2025-12-04T17:18:11.3461176Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T17:18:11.3477463Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:11.3487147Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:11.3499116Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:11.3509819Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T17:18:11.3519491Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:11.3531764Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:11.3546691Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:11.3557043Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.3565573Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:11.3579227Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T17:18:11.3587356Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:11.3596813Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T17:18:11.3611439Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:11.3625011Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T17:18:11.3636769Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:11.3649902Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T17:18:11.3658498Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:11.3667777Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T17:18:11.3676706Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.3687966Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T17:18:11.3697616Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.3708256Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T17:18:11.3718904Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:11.3728160Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T17:18:11.3747018Z Entering 'third_party/pocketfft' 2025-12-04T17:18:11.3756597Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T17:18:11.3765790Z Entering 'third_party/protobuf' 2025-12-04T17:18:11.3779790Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T17:18:11.3793192Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:11.3802544Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:11.3811438Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:11.3821326Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.3832727Z Entering 'third_party/psimd' 2025-12-04T17:18:11.3842854Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T17:18:11.3851704Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:11.3861800Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T17:18:11.3872291Z Entering 'third_party/pybind11' 2025-12-04T17:18:11.3884305Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:11.3893648Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:11.3903048Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T17:18:11.3911800Z Entering 'third_party/sleef' 2025-12-04T17:18:11.3920884Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T17:18:11.3930061Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:11.3939970Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T17:18:11.3948886Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:11.3961732Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:11.3973189Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:11.3984572Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T17:18:11.3993755Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:11.4003764Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T17:18:11.4012495Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:11.4022127Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:11.4031000Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:11.4051654Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T17:18:11.4084825Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4105065Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4121655Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4138454Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4155578Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4177960Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4195208Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4210640Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4225566Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4240443Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4256533Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4272594Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4288470Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4308726Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4323911Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4340537Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4355157Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4369292Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4394191Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4410121Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4424720Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4444218Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4459028Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4486154Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4509517Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4531875Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4557678Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4579886Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4599742Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4616400Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4635561Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4652173Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4667828Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4683877Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4700622Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4716774Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4733007Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4759277Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4775236Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4792905Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4808823Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4841717Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4860551Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4878250Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4898807Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4922605Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4941413Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4966357Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4983162Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.4999950Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5016352Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5032508Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5050180Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5067098Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5082835Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5100108Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5116597Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5134118Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5152017Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5170319Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5186284Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5202201Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5227429Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5256906Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5274768Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5296629Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5313615Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5330814Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5349315Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5369542Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5386580Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5406930Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5424497Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5444940Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5461203Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5479019Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5496174Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5514050Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5532339Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5552349Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5569851Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:11.5687765Z Post job cleanup. 2025-12-04T17:18:11.6201946Z [command]/usr/bin/git version 2025-12-04T17:18:11.6228042Z git version 2.52.0 2025-12-04T17:18:11.6243744Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/9757edb4-c400-4279-81c9-0a247fd1ab27/.gitconfig' 2025-12-04T17:18:11.6248813Z Temporarily overriding HOME='/home/runner/_work/_temp/9757edb4-c400-4279-81c9-0a247fd1ab27' before making global git config changes 2025-12-04T17:18:11.6249166Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T17:18:11.6250999Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T17:18:11.6271505Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T17:18:11.6288150Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T17:18:11.6521016Z Entering 'android/libs/fbjni' 2025-12-04T17:18:11.6558927Z Entering 'third_party/FP16' 2025-12-04T17:18:11.6591937Z Entering 'third_party/FXdiv' 2025-12-04T17:18:11.6612985Z Entering 'third_party/NNPACK' 2025-12-04T17:18:11.6637653Z Entering 'third_party/NVTX' 2025-12-04T17:18:11.6663425Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:11.6687291Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:11.6723637Z Entering 'third_party/aiter' 2025-12-04T17:18:11.6755963Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:11.6801786Z Entering 'third_party/benchmark' 2025-12-04T17:18:11.6830659Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:11.6856169Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:11.6880119Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:11.6906011Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:11.6929083Z Entering 'third_party/cutlass' 2025-12-04T17:18:11.6954702Z Entering 'third_party/fbgemm' 2025-12-04T17:18:11.7003950Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:11.7036024Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:11.7073546Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:11.7105993Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:11.7139298Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:11.7166367Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:11.7194328Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:11.7239187Z Entering 'third_party/flash-attention' 2025-12-04T17:18:11.7261923Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:11.7296815Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:11.7326629Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:11.7348409Z Entering 'third_party/fmt' 2025-12-04T17:18:11.7368804Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:11.7389165Z Entering 'third_party/gloo' 2025-12-04T17:18:11.7408990Z Entering 'third_party/googletest' 2025-12-04T17:18:11.7431472Z Entering 'third_party/ideep' 2025-12-04T17:18:11.7455135Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:11.7490170Z Entering 'third_party/ittapi' 2025-12-04T17:18:11.7517378Z Entering 'third_party/kineto' 2025-12-04T17:18:11.7545469Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:11.7578981Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:11.7605003Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:11.7632409Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:11.7670530Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:11.7693755Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:11.7725538Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:11.7747010Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:11.7773085Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:11.7797186Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:11.7819671Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:11.7844629Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.7871768Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.7915842Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:11.7939097Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:11.7965671Z Entering 'third_party/kleidiai' 2025-12-04T17:18:11.7991418Z Entering 'third_party/mimalloc' 2025-12-04T17:18:11.8018098Z Entering 'third_party/nlohmann' 2025-12-04T17:18:11.8042470Z Entering 'third_party/onnx' 2025-12-04T17:18:11.8070218Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:11.8098234Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:11.8130704Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:11.8153481Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:11.8182997Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:11.8204805Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:11.8228005Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:11.8250398Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:11.8284542Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:11.8312671Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:11.8340858Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:11.8366267Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:11.8399071Z Entering 'third_party/pocketfft' 2025-12-04T17:18:11.8430448Z Entering 'third_party/protobuf' 2025-12-04T17:18:11.8462055Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:11.8488894Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:11.8512907Z Entering 'third_party/psimd' 2025-12-04T17:18:11.8535455Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:11.8563099Z Entering 'third_party/pybind11' 2025-12-04T17:18:11.8584919Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:11.8605477Z Entering 'third_party/sleef' 2025-12-04T17:18:11.8627912Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:11.8660101Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:11.8683795Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:11.8706636Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:11.8735859Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:11.8759734Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:11.8805189Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T17:18:11.8825915Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T17:18:11.9006075Z Entering 'android/libs/fbjni' 2025-12-04T17:18:11.9029288Z Entering 'third_party/FP16' 2025-12-04T17:18:11.9060190Z Entering 'third_party/FXdiv' 2025-12-04T17:18:11.9086752Z Entering 'third_party/NNPACK' 2025-12-04T17:18:11.9107568Z Entering 'third_party/NVTX' 2025-12-04T17:18:11.9137262Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:11.9163195Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:11.9193607Z Entering 'third_party/aiter' 2025-12-04T17:18:11.9218075Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:11.9263995Z Entering 'third_party/benchmark' 2025-12-04T17:18:11.9289074Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:11.9333306Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:11.9362612Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:11.9397075Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:11.9418924Z Entering 'third_party/cutlass' 2025-12-04T17:18:11.9457462Z Entering 'third_party/fbgemm' 2025-12-04T17:18:11.9490954Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:11.9519028Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:11.9556224Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:11.9597934Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:11.9634675Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:11.9664246Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:11.9689610Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:11.9719295Z Entering 'third_party/flash-attention' 2025-12-04T17:18:11.9764942Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:11.9795085Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:11.9843758Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:11.9869476Z Entering 'third_party/fmt' 2025-12-04T17:18:11.9892336Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:11.9913232Z Entering 'third_party/gloo' 2025-12-04T17:18:11.9933717Z Entering 'third_party/googletest' 2025-12-04T17:18:11.9960270Z Entering 'third_party/ideep' 2025-12-04T17:18:11.9993922Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:12.0030980Z Entering 'third_party/ittapi' 2025-12-04T17:18:12.0060128Z Entering 'third_party/kineto' 2025-12-04T17:18:12.0086727Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:12.0113994Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:12.0140562Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:12.0175663Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:12.0216987Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:12.0243322Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:12.0278479Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:12.0301676Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:12.0328323Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:12.0362672Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:12.0388965Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:12.0415735Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:12.0448405Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:12.0478363Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:12.0510810Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:12.0540479Z Entering 'third_party/kleidiai' 2025-12-04T17:18:12.0571712Z Entering 'third_party/mimalloc' 2025-12-04T17:18:12.0598357Z Entering 'third_party/nlohmann' 2025-12-04T17:18:12.0637136Z Entering 'third_party/onnx' 2025-12-04T17:18:12.0667375Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:12.0696101Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:12.0732049Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:12.0759000Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:12.0786041Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:12.0811727Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:12.0837989Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:12.0867186Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:12.0895652Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:12.0922227Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:12.0950543Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:12.0981006Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:12.1012813Z Entering 'third_party/pocketfft' 2025-12-04T17:18:12.1046202Z Entering 'third_party/protobuf' 2025-12-04T17:18:12.1072660Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:12.1098481Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:12.1128501Z Entering 'third_party/psimd' 2025-12-04T17:18:12.1154908Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:12.1185658Z Entering 'third_party/pybind11' 2025-12-04T17:18:12.1216509Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:12.1241114Z Entering 'third_party/sleef' 2025-12-04T17:18:12.1264969Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:12.1290959Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:12.1313574Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:12.1350985Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:12.1373604Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:12.1406215Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:12.1462887Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.1492108Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T17:18:12.1685897Z Entering 'android/libs/fbjni' 2025-12-04T17:18:12.1697039Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T17:18:12.1707314Z Entering 'third_party/FP16' 2025-12-04T17:18:12.1720237Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T17:18:12.1730279Z Entering 'third_party/FXdiv' 2025-12-04T17:18:12.1744976Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T17:18:12.1759413Z Entering 'third_party/NNPACK' 2025-12-04T17:18:12.1771667Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T17:18:12.1782036Z Entering 'third_party/NVTX' 2025-12-04T17:18:12.1794182Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T17:18:12.1803125Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T17:18:12.1814631Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T17:18:12.1824980Z Entering 'third_party/XNNPACK' 2025-12-04T17:18:12.1839292Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T17:18:12.1866559Z Entering 'third_party/aiter' 2025-12-04T17:18:12.1878770Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T17:18:12.1891191Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T17:18:12.1905954Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T17:18:12.1919375Z Entering 'third_party/benchmark' 2025-12-04T17:18:12.1930239Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:12.1942422Z Entering 'third_party/composable_kernel' 2025-12-04T17:18:12.1955258Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T17:18:12.1970725Z Entering 'third_party/cpp-httplib' 2025-12-04T17:18:12.1981897Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T17:18:12.1991805Z Entering 'third_party/cpuinfo' 2025-12-04T17:18:12.2006223Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T17:18:12.2016088Z Entering 'third_party/cudnn_frontend' 2025-12-04T17:18:12.2028290Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T17:18:12.2039682Z Entering 'third_party/cutlass' 2025-12-04T17:18:12.2050488Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T17:18:12.2062592Z Entering 'third_party/fbgemm' 2025-12-04T17:18:12.2074311Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T17:18:12.2082898Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T17:18:12.2095930Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T17:18:12.2109813Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T17:18:12.2122348Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T17:18:12.2139381Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T17:18:12.2154966Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T17:18:12.2165417Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T17:18:12.2177040Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T17:18:12.2197089Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T17:18:12.2216843Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T17:18:12.2226163Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T17:18:12.2238252Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T17:18:12.2253109Z Entering 'third_party/fbgemm/external/json' 2025-12-04T17:18:12.2268376Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T17:18:12.2283566Z Entering 'third_party/flash-attention' 2025-12-04T17:18:12.2300237Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T17:18:12.2310757Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T17:18:12.2329028Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T17:18:12.2341272Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T17:18:12.2354289Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T17:18:12.2367984Z Entering 'third_party/flatbuffers' 2025-12-04T17:18:12.2393356Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T17:18:12.2406574Z Entering 'third_party/fmt' 2025-12-04T17:18:12.2416597Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T17:18:12.2431562Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T17:18:12.2444115Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T17:18:12.2464028Z Entering 'third_party/gloo' 2025-12-04T17:18:12.2474857Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T17:18:12.2485257Z Entering 'third_party/googletest' 2025-12-04T17:18:12.2496074Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.2513466Z Entering 'third_party/ideep' 2025-12-04T17:18:12.2524654Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T17:18:12.2536301Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T17:18:12.2549885Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T17:18:12.2569288Z Entering 'third_party/ittapi' 2025-12-04T17:18:12.2588359Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T17:18:12.2598029Z Entering 'third_party/kineto' 2025-12-04T17:18:12.2618253Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T17:18:12.2631383Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T17:18:12.2646803Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T17:18:12.2655814Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T17:18:12.2672429Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T17:18:12.2687452Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T17:18:12.2708639Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T17:18:12.2719293Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T17:18:12.2735334Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T17:18:12.2749399Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T17:18:12.2771074Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T17:18:12.2781042Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T17:18:12.2795648Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T17:18:12.2807792Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T17:18:12.2819972Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T17:18:12.2829254Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T17:18:12.2845813Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.2865070Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T17:18:12.2874804Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T17:18:12.2883443Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T17:18:12.2894385Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T17:18:12.2902849Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T17:18:12.2913403Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T17:18:12.2922171Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:12.2932656Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T17:18:12.2941816Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:12.2955484Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T17:18:12.2967585Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T17:18:12.2976284Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T17:18:12.2985053Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T17:18:12.2994268Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.3004610Z Entering 'third_party/kleidiai' 2025-12-04T17:18:12.3014861Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T17:18:12.3024514Z Entering 'third_party/mimalloc' 2025-12-04T17:18:12.3034287Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T17:18:12.3043985Z Entering 'third_party/nlohmann' 2025-12-04T17:18:12.3053833Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T17:18:12.3063678Z Entering 'third_party/onnx' 2025-12-04T17:18:12.3081362Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T17:18:12.3097748Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T17:18:12.3116205Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:12.3129656Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T17:18:12.3140135Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T17:18:12.3151833Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T17:18:12.3166575Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:12.3186604Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T17:18:12.3209658Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.3218719Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T17:18:12.3232267Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T17:18:12.3243739Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T17:18:12.3260827Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T17:18:12.3269867Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T17:18:12.3280573Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T17:18:12.3288920Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T17:18:12.3299854Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T17:18:12.3311868Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T17:18:12.3323898Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T17:18:12.3336543Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T17:18:12.3352021Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T17:18:12.3361958Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T17:18:12.3371412Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T17:18:12.3384746Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T17:18:12.3394432Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T17:18:12.3414308Z Entering 'third_party/pocketfft' 2025-12-04T17:18:12.3425196Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T17:18:12.3434560Z Entering 'third_party/protobuf' 2025-12-04T17:18:12.3444777Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T17:18:12.3455388Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T17:18:12.3465252Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T17:18:12.3473549Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T17:18:12.3482567Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.3493895Z Entering 'third_party/psimd' 2025-12-04T17:18:12.3505228Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T17:18:12.3514983Z Entering 'third_party/pthreadpool' 2025-12-04T17:18:12.3524907Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T17:18:12.3534096Z Entering 'third_party/pybind11' 2025-12-04T17:18:12.3544240Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:12.3553513Z Entering 'third_party/python-peachpy' 2025-12-04T17:18:12.3564602Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T17:18:12.3582346Z Entering 'third_party/sleef' 2025-12-04T17:18:12.3596396Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T17:18:12.3622401Z Entering 'third_party/tensorpipe' 2025-12-04T17:18:12.3636546Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T17:18:12.3647138Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T17:18:12.3659276Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T17:18:12.3670334Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T17:18:12.3680652Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T17:18:12.3689637Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T17:18:12.3706502Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T17:18:12.3713950Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T17:18:12.3723347Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T17:18:12.3732302Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T17:18:12.3743550Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T17:18:12.3778301Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3809914Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3827997Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3844968Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3871269Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3895313Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3914049Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3930911Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3955247Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3973514Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.3991458Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4007467Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4027002Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4048997Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4065616Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4081202Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4096247Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4111489Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4128254Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4142204Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4158002Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4177795Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4196609Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4218231Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4234187Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4253096Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4268628Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4287036Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4308003Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4335497Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4352502Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4379369Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4402179Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4426309Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4449329Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4465568Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4487228Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4513833Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4532879Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4553655Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4569626Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4584399Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4606707Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4625131Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4641192Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4655337Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4669836Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4685818Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4701703Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4720391Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4735244Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4751500Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4767438Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4782302Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4799670Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4816081Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4833036Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4866617Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4881665Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4906121Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4923715Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4944952Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4963320Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.4983657Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5000122Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5015640Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5030833Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5047139Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5062857Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5078723Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5094610Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5120819Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5137665Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5158396Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5175251Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5206188Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5237887Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5252430Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5271457Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5287040Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5302707Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T17:18:12.5425329Z Cleaning up orphan processes 2025-12-04T17:18:12.5510593Z Terminate orphan process: pid (17167) (docker)